Bayesian semi-parametric methods for non-ignorable dropout

(1)

BAYESIAN SEMI-PARAMETRIC METHODS FOR NON-IGNORABLE DROPOUT by

CAMILLE M. MOORE B.A., Cornell University, 2003 M.A., New York University, 2007 M.S., University of Colorado, 2013

A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment

of the requirements for the degree of Doctor of Philosophy

Biostatistics Program 2016

(2)

This thesis for the Doctor of Philosophy degree by Camille M. Moore

has been approved for the Biostatistics Program

by

Gary Grunwald, Chair Samantha MaWhinney, Advisor

Nichole E. Carlson Elizabeth Connick

Ed Bedrick

(3)

Moore, Camille M. (Ph.D., Biostatistics)

Bayesian Semi-Parametric Methods for Non-Ignorable Dropout Thesis directed by Professor Samantha MaWhinney

ABSTRACT

Dropout is a common problem in longitudinal studies; when the probability of dropout depends on unobserved outcomes, even after conditioning on available data, data may be missing not at random, and dropout is not ignorable. Selection, frailty, and mixture model frameworks have been proposed to adjust for dropout; however, most methods rely on parametric assumptions about the distribution of dropout times and the relationship between the outcome and dropout time. We propose two Bayesian semi-parametric frameworks to account for dropout, including a natural cubic B-spline varying coefficient model and a Dirichlet process mixture model. These methods can model longitudinal Gaussian outcomes as well as non-normal outcomes in the exponential family, while relaxing common parametric assumptions and making inference more robust. These methods are extended to the joint modeling of two correlated longitudinal outcomes in the presence of dropout. We apply these methods to investigate longitudinal changes in CD4+ _{T cell count, viral load, and viral load suppression in two HIV cohort studies.}

The form and content of this abstract are approved. I recommend its publication. Approved: Samantha MaWhinney

(4)

ACKNOWLEDGEMENTS

I would like to express my deep appreciation and gratitude to Sam MaWhinney and Nichole Carlson, my co-advisors, who have supported and guided me through my entire graduate school experience, from my masters thesis work to writing my first NIH grant and completing this dissertation research. They have been truly generous with their time. I would also like to thank Jeri Forster, Gary Grunwald, Katerina Kechris and Liz Connick for their thoughtful reviews of my grant and dissertation proposal and Sarah Kreidler for translating this work into an R package, which will help make these methods accessible to others. This thesis would not have been possible without the generous support of the National Institute on Drug Abuse, grant number DA037778. I would like to thank my program officer, Peter Hartsock, for taking an interest in my work and supporting research in statistical methods and mathematical modeling.

(5)

TABLE OF CONTENTS CHAPTER

I. INTRODUCTION . . . 1

II. A BAYESIAN NATURAL CUBIC B SPLINE VARYING COEFFICIENT METHOD FOR NON-IGNORABLE DROPOUT . . . 4

II.1 Summary . . . 4

II.2 Introduction . . . 4

II.3 Bayesian Varying Coefficient Models for Non-Ignorable Dropout . . . 8

II.3.1 Model for Exponential Family Outcomes with Dropout-Varying Coefficients . . . 9

II.3.2 Natural Cubic B-Splines . . . 9

II.3.3 Dropout Time Model and Bayesian Bootstrapping . . . 10

II.3.4 The Calculation of Marginal Effects . . . 10

II.3.5 Priors . . . 11

II.4 Estimation . . . 13

II.4.1 Reversible Jump Markov Chain Monte Carlo . . . 13

II.4.2 Implementation Issues . . . 14

II.5 Simulation Study . . . 15

II.5.1 Models . . . 15

II.5.2 Methods of Evaluation . . . 16

II.5.3 Implementation . . . 16

II.5.4 Results . . . 17

II.6 Applications to the WIHS . . . 17

II.6.1 Methods . . . 18

II.6.2 Results . . . 19

II.6.3 Sensitivity Analyses . . . 20

(6)

II.7 Discussion . . . 21

III. A DIRICHLET PROCESS MIXTURE MODEL FOR NON-IGNORABLE DROPOUT . . . 26

III.1 Summary . . . 26

III.2 Introduction . . . 26

III.3 Methods . . . 29

III.3.1 Model . . . 30

III.3.2 Priors and Hyperpriors . . . 32

III.3.3 Estimation . . . 32

III.3.4 Calculation of Marginal and Dropout Time Specific Effects . . . . 34

III.3.5 Sensitivity to Priors . . . 34

III.4 Simulation . . . 35

III.4.1 Methods of Evaluation . . . 36

III.4.2 Implementation . . . 36 III.4.3 Results . . . 37 III.5 Application . . . 37 III.5.1 Methods . . . 38 III.5.2 Results . . . 39 III.6 Discussion . . . 40

IV. SEMIPARAMETRIC BAYESIAN METHODS FOR THE JOINT MODELING OF TWO OUTCOMES IN THE PRESENCE OF NON-IGNORABLE DROPOUT . . . 46

IV.1 Summary . . . 46

IV.2 Introduction . . . 46

IV.3 Methods . . . 48

(7)

IV.3.2 Bivariate DP-Drop . . . 51

IV.4 Simulation Study . . . 55

IV.4.1 Methods of Evaluation . . . 55

IV.4.2 Implementation . . . 55

IV.4.3 Results . . . 56

IV.5 Application to the WIHS . . . 57

IV.5.1 Methods . . . 58 IV.5.2 Results . . . 59 IV.6 Discussion . . . 60 V. CONCLUSION . . . 63 REFERENCES . . . 66 APPENDIX A. MISCELLANEA FOR CHAPTER II . . . 74

A.1 Reversible Jump MCMC Algorithm for the BNSV . . . 74

A.1.1 Dimension Change . . . 74

A.1.2 Move a Knot . . . 78

A.1.3 Update Fixed Effect Coefficients . . . 79

A.1.4 Update Random Effects . . . 80

A.1.5 Update Variance Components . . . 81

A.2 Additional Simulation Studies . . . 82

A.3 Sensitivity Analyses . . . 84

A.3.1 CD4+ _{Decline . . . .} ₈₄

A.3.2 Viral Load Suppression . . . 86

A.4 Comparison of the BNSV to the CLM . . . 87

A.4.1 Methods of Evaluation . . . 88

(8)

A.4.2 Implementation . . . 88

A.4.3 Results . . . 89

B. MISCELLANEA FOR CHAPTER III . . . 97

B.1 MCMC Algorithm for the DP-Drop . . . 97

B.2 Additional Simulation Studies . . . 99

B.3 Sensitivity Analyses . . . 101

C. MISCELLANEA FOR CHAPTER IV . . . 104

C.1 Reversible Jump MCMC Algorithm for the Bivariate BNSV . . . 104

(9)

LIST OF TABLES TABLE

II.1 Simulation Study Comparing BNSV and GLMM Methods: Posterior Mean

Marginal Slope Estimates, Bias, Variance and Mean Squared Error . . . 22

II.2 Demographic Characteristics of Untreated Subjects with HIV Disease in the WIHS by Drug Use Group . . . 23

II.3 Estimated Changes in ln(CD4+_{) and ln(Odds of Viral Load Suppression) per}

Year for Untreated Subjects in the WIHS Using the BNSV and GLMM

Methods . . . 23

III.1 Simulation Study Comparing the DP-Drop, Simple Frailty Model, DP-GLMM, and GLMM Methods: Posterior Mean Marginal Slope

Estimates, Bias, Variance and Mean Squared Error . . . 41

III.2 Demographic Characteristics of Untreated Subjects with HIV Disease in the MACS by Drug Use Group . . . 43

III.3 Estimated Change in ln(CD4+_{) per year for Untreated Subjects in the}

MACS Using the DP-Drop and LMM Methods . . . 44

IV.1 Simulation Study Comparing Univariate and Bivariate BNSV, DP-Drop and LMM Methods: Posterior Mean Marginal Slope Estimates, Bias, Variance

and Mean Squared Error . . . 60

IV.2 Demographic Characteristics of Untreated Consistent Hard Drug Users with HIV Disease in the WIHS . . . 62

IV.3 Univariate and Bivariate BNSV, DP-Drop and LMM Posterior Mean and 95% Credible Interval (CI) Estimates of the Marginal Intercepts and Slopes for ln(CD4+_{) and log}

10(Viral Load) in Untreated Hard Drug Users in the

WIHS . . . 62

IV.4 Bivariate BNSV, DP-Drop and LMM Posterior Mean and 95% Credible Interval (CI) Estimates of the Correlations Between ln(CD4+_{) and}

log10(Viral Load) Intercepts and Slopes for Untreated Hard Drug Users in

the WIHS . . . 62

A.1 Additional Simulation Study Comparing the BNSV and GLMM Methods:

Linear and Constant Dropout Varying Slopes . . . 82

A.2 Additional Simulation Study Comparing the BNSV and GLMM Methods: Posterior Mean Estimates, Bias, Variance and Mean Squared Error for the

Marginal Slope . . . 83

(10)

A.3 Sensitivity Analyses: Estimated Ln(CD4+_{) for Untreated Subjects in the}

WIHS for a Range of δ. . . 85

A.4 Sensitivity Analyses: Estimated Ln(Odds of Suppression) for Treated

Subjects in the WIHS for a Range of δ. . . 86

A.5 Simulation Study Comparing the BNSV and CLM: Simulated Dropout

Varying Slopes . . . 88

A.6 Simulation Study Comparing the BNSV and CLM: Posterior Mean

Estimates of the Marginal Slope, Bias, Variance, and Mean Squared Error . 91

B.1 Additional Simulation Study Comparing the DP-Drop, Simple Frailty, DP-GLMM and GLMM Methods: Simulated Linear and Constant Dropout

Varying Slopes . . . 100

B.2 Additional Simulation Study Comparing the DP-Drop, Simple Frailty, DP-GLMM and GLMM Methods: Posterior Mean Marginal Slope

Estimates, Bias, Variance and Mean Squared Error . . . 100

B.3 Sensitivity Analyses: Estimated Ln(CD4+_{) for Untreated Subjects in the}

(11)

LIST OF FIGURES FIGURE

II.1 Simulation Study of the BNSV Method: Posterior Mean Estimates of the

Dropout-Varying Slope . . . 22

II.2 BNSV and LMM Posterior Mean CD4+ _{Count and 95% Credible Interval}

(CI) over Time in Untreated Subjects in the WIHS . . . 24

II.3 BNSV and GLMM Posterior Mean Probability of Suppression and 95% Credible Interval (CI) over Time in a Subject that Initiated HAART in the

WIHS . . . 25

III.1 Simulation Study Comparing the DP-Drop and Simple Frailty Model

Methods: Posterior Mean Estimates of the Dropout Varying Slope . . . 42

III.2 Simulation Study Comparing the DP-Drop, Simple Frailty Model, DP-GLMM, and GLMM Methods: Estimated Densities and Actual

Distribution of the subject-specific Slope for a Single Simulated Dataset . . . 43

III.3 DP-Drop and LMM Posterior Mean CD4+ _{Count and 95% Credible Interval}

(CI) over Time in Untreated Subjects in the MACS . . . 44

III.4 DP-Drop and LMM Posterior Mean Density of the subject-specific CD4+

Count Intercept and Slope for Untreated Subjects in the MACS . . . 45

IV.1 Simulation Study of the Bivariate BNSV and DP-Drop Methods: Posterior

Mean Estimates of the Dropout-Varying Slope for Outcome 1 . . . 61

IV.2 Simulation Study of the Bivariate BNSV and DP-Drop Methods: Posterior

Mean Estimates of the Dropout-Varying Slope for Outcome 2 . . . 61

A.1 Additional Simulation Study of the BNSV: Dropout Varying Slopes . . . 84

A.2 Sensitivity Analysis: Estimated CD4+ _{Over Time in Untreated Subjects in}

the WIHS Assuming a Proportional Attenuation of the Slope after Dropout 85

A.3 Sensitivity Analysis: Estimated Probability of Viral Load Suppression Over Time for Treated Subjects in the WIHS Assuming a Proportional

Attenuation of the Slope after Dropout . . . 87

A.4 Simulation Study Comparing the BNSV and CLM: Dropout Varying Slope

for Form (i), Smooth Continuous Function . . . 92

for Form (ii), Step Function . . . 93

(12)

for Form (iii), Linear Function . . . 94

for Form (iv), No Dropout Effect . . . 95

for Form (v), Smooth Continuous Function, Slight Dropout Effect . . . 96

B.1 Sensitivity Analysis: Estimated CD4+ _{Over Time in Untreated Subjects in}

(13)

CHAPTER I INTRODUCTION

Dropout is common in longitudinal studies, and when the probability of dropout depends on unobserved outcomes, even after conditioning on available data, data may be missing not at random, and dropout is not ignorable. Traditional longitudinal data analysis methods, such as mixed-effects models, do not account for non-ignorable dropout, and can potentially lead to biased results (Fairclough, 2002). For example, several lines of evidence suggest that HIV+ individuals with poor clinical outcomes may be more likely to drop out (Lanoya et al., 2006; Hessol et al., 2009, 2001) and to be removed from analyses (Hogan et al.,2004b,a; Liu et al., 2006). These subjects may also be more likely to use illicit drugs or alcohol (Lanoya et al., 2006), which have been associated with more rapid disease progression (Lucas et al.,2002;Samet et al., 2003;Theall et al.,

2007;Chander et al.,2006). Over time, cohorts may evolve to be biased towards subjects with better outcomes and healthier lifestyles such that estimates are overly optimistic. Thus, analyses must account for dropout; however, naive methods such as mixed-effects models (Laird and Ware, 1982) are frequently used, in part since investigators may be unaware of possible biases (Little and Rubin, 2002) and loss of power (Wu and Carroll,

1988). In addition to HIV cohort studies, similar dropout related challenges have been identified in quality of life data from clinical trials of cancer therapies (Fairclough et al.,

1998), anti-depressant clinical trials (Molenberghs et al., 2004), and studies of smoking cessation programs (Hogan et al., 2004b), among others.

Selection (Heckman, 1979; Diggle et al., 1994; Heckman, 1998), frailty (Follmann and Wu, 1995; DeGruttola and Tu, 1994; Albert and Follmann, 2000; Schluchter, 1992;

Lancaster and Intrator, 1998) and mixture model (Rubin, 1977; Wu and Bailey, 1989;

Little,1994,1993) frameworks have been proposed to adjust for potentially non-ignorable dropout. While all methods for dealing with missing data must make some unavoidable

(14)

assumptions about the behavior of unobserved outcomes, most methods rely on additional parametric assumptions about the distribution of dropout times and the relationship between the outcome and dropout time. Semiparametric methods can help relax these assumptions and make inference more robust.

In Chapter II, we present a Bayesian framework for a semi-parametric varying coef-ficient model for exponential family longitudinal data with non-ignorable dropout. The model allows coefficients to vary with dropout time through smooth functions, which are modeled with natural cubic B-splines. The full data is a mixture over the distribution of dropout times, which can be left unspecified. Our Bayesian framework utilizes a re-versible jump Markov chain Monte Carlo approach that does not require the choice of a single set of spline knots to make statistical inference, making inference less dependent on "nuisance parameters." These methods reduce bias and more accurately characterize model uncertainty than standard linear mixed models and the conditional linear model to account for dropout. To demonstrate these methods, we present results from a sim-ulation study and estimate the impact of hard drug use on longitudinal CD4+ T cell count and viral load suppression in a cohort of HIV infected subjects.

In Chapter III, we propose a Bayesian Dirichlet process mixture model as a semi-parametric alternative to commonly used semi-parametric frailty models to account for dropout. In our semi-parametric Dirichlet-process method, subject-specific effects and the natural log of a subject’s dropout time are assumed to come from a Dirichlet process mixture of multivariate normal distributions. If non-ignorable dropout is present, this model is able to cluster subjects with similar subject-specific effects and dropout times, al-lowing subject-specific effects to be flexibly associated with dropout time. There is no need to specify a parametric form for the relationship between dropout time and the subject-specific effects, as in the conditional linear model (Wu and Bailey, 1989), or to a priori group subjects with subjectively similar patterns of missing data together, as in a pattern mixture model (Pauler et al.,2003). The distribution of dropout times is also

(15)

non-parametrically modeled as a mixture of log-normal distributions. Results from sim-ulation studies as well as an application to a publicly available longitudinal HIV cohort study database are presented.

In Chapter IV, two methods to account for dropout while jointly modeling two contin-uous longitudinal outcomes are presented. Longitudinal CD4+ _{T cell count and plasma}

HIV RNA are correlated biomarkers commonly used in the management of patients with HIV disease (Carpenter et al., 1998). These outcomes are highly variable, even within individuals, due to measurement error and short-term variability, as well as long-term trends, making it difficult to tease apart these sources of variation in order to under-stand long-term changes in the markers over time (Hughes et al.,1994;Zeger and Diggle,

1994; Paxton et al., 1997; Raboud et al., 1996). In addition, CD4+ _{T cell count and}

viral load are often repeatedly measured at irregular intervals and cohort studies often have many missing observations. Bivariate models for CD4+ _{T cell count and viral load}

have been proposed in order to understand the correlation between these two inherently noisy outcomes, and it has been suggested that bivariate methods also improve model fit (Sy et al., 1997; Brown et al., 2001; Boscardin et al., 1998; Thiebaut et al., 2003, 2005,

2002). Thiebaut (Thiebaut et al., 2005) proposed a simple frailty model to account for dropout while jointly modeling CD4+ _{T cell count and viral load; however, this model}

relies heavily on parametric assumptions. In order to relax these assumptions, we pro-pose two semiparametric Bayesian joint models, based on extensions to the univariate models presented in Chapters II and III. Both of these models allow for more flexible relationships between the two outcomes, while relaxing assumptions about the dropout time distribution and the relationship between dropout time and the change in the out-comes over time. Results of a simulation study comparing these bivariate methods to their univariate counterparts and an application to the Women’s Interagency HIV Study are presented. In Chapter V, future directions of research are considered.

(16)

CHAPTER II

A BAYESIAN NATURAL CUBIC B SPLINE VARYING COEFFICIENT METHOD FOR NON-IGNORABLE DROPOUT

II.1 Summary

Dropout is a common problem in longitudinal clinical trials and cohort studies, and is of particular concern when dropout occurs for reasons that may be related to the outcome of interest. We present a Bayesian framework for a semi-parametric varying coefficient model for exponential family longitudinal data with non-ignorable dropout. The model allows coefficients to vary with dropout time through smooth functions, which are modeled with natural cubic B-splines. The full data is a mixture over the distribution of dropout times, which can be left unspecified. Our Bayesian framework utilizes a reversible jump Markov chain Monte Carlo approach that does not require the choice of a single set of spline knots to make statistical inference, making inference less dependent on "nuisance parameters." These methods reduce bias and more accurately characterize model uncertainty than standard linear mixed models and the conditional linear model to account for dropout. To demonstrate these methods, we present results from a simulation study and estimate the impact of hard drug use on longitudinal CD4+ T cell count and viral load suppression in a cohort of HIV infected subjects.

II.2 Introduction

HIV studies are often longitudinal in nature, with subjects followed for extended peri-ods of time to study disease progression in different subpopulations. Dropout is common in these studies, and it is well documented that many subjects have missing observations due to death or disease progression, leading to concerns of non-ignorable dropout (Lanoya et al., 2006). Dropout is not ignorable when missingness depends on the values of the unobserved outcomes, even after conditioning on the available data (Little and Rubin,

(17)

For example, this work was motivated by the challenges associated with comparing lab-oratory markers of HIV disease progression between hard drug users and other subjects in the Women’s Interagency HIV Study (WIHS). Since subjects tend to drop out due to declining health, results from traditional longitudinal analysis methods may be overly optimistic, as they are biased towards study completers, who tend to have favorable out-comes. In addition, since dropout and missing data are more common among drug users, differences between drug users and other subjects may also be underestimated. Similar dropout related challenges have been identified in quality of life data from clinical trials of cancer therapies (Fairclough et al., 1998), anti-depressant clinical trials (Molenberghs et al., 2004), and studies of smoking cessation programs (Hogan et al., 2004b), among others.

This paper develops a Bayesian framework for semi-parametric varying coefficient generalized linear mixed models (GLMM) to flexibly accommodate dropout in longitu-dinal studies with exponential family outcomes. While all methods that account for non-ignorable dropout rely on unavoidable assumptions that cannot be verified from the observed data (Daniels and Hogan, 2008), many methods make additional parametric assumptions about the distribution of dropout times or the functional form of the re-lationship between regression coefficients and dropout time. These assumptions can be relaxed using semi-parametric methods. We extend existing frequentist natural cubic B-spline varying coefficient methods to account for dropout in longitudinal studies with a Gaussian outcome (Forster et al., 2012; Moore et al., 2015) to other non-normal out-comes in the exponential family. Our method also extends these models into a Bayesian framework, allowing the number and location of spline knots to be jointly modeled with other model parameters, removing dependence on the choice of spline knots and more accurately characterizing model uncertainty. In addition, this method extends work by

Biller and Fahrmeir (2001) on knot selection in generalized linear models to hierarchical models.

(18)

While there is a large body of literature describing selection (Heckman, 1979;Diggle et al., 1994; Heckman, 1998), frailty (Follmann and Wu, 1995; Albert and Follmann,

2000;Wu and Bailey,1989; Schluchter,1992; Lancaster and Intrator,1998) and mixture models (Rubin, 1977; Wu and Bailey, 1989, 1988) to account for non-ignorable dropout in longitudinal studies with a Gaussian response, methods for non-normal data are less developed (Ibrahim and Molenberghs, 2009). The literature is particularly sparse for ad-dressing non-ignorable dropout in GLMMs in semi-parametric or Bayesian frameworks. In a frequentist setting, parametric selection (Ibrahim et al., 2001; Wu and Wu, 2007), frailty (Ten Have et al.,1998), and mixture models (Ekholm and Skinner,1998;Follmann and Wu, 1995; Fitzmaurice et al., 2001) have been proposed to account for dropout in studies with binary and other non-normal outcomes; however, semi-parametric methods are lacking. In a Bayesian framework, methods for binary outcomes have focused largely on marginalized transition models for population level rather than subject-specific in-ference (Su and Hogan, 2008, 2010). For subject-specific inference, Kaciroti et al have described Bayesian pattern mixture models (PMM) for binary and count data (Kaciroti et al., 2009, 2012), however, these methods may not be feasible for large numbers of dropout patterns or continuous dropout times.

Varying coefficient models (VCM) are mixture models that adjust for dropout by allowing regression coefficients to depend on dropout time. Mixture models (Wu and Carroll, 1988; Wu and Bailey, 1989; Mori et al., 1992; Hogan and Laird, 1997) account for dropout by factoring the joint distribution of the outcome, y, and dropout time, u_{, as f (y, u) = f (y|u)f(u), and the full-data response distribution f(y) is given by} R f(y|u)dF (u). In the VCM framework, if the regression coefficients are constant with respect to dropout time, the model reduces to a standard GLMM. Assuming regression coefficients are polynomial functions of dropout time results in Wu and Bailey’s condi-tional linear model (Daniels and Hogan, 2008); however, misspecification of the polyno-mial function can lead to biased estimates (Hogan et al., 2004a; Forster et al., 2012).

(19)

Semi-parametric varying coefficient models can flexibly account for dropout, requiring only that regression coefficients are smooth, continuous functions of dropout time, mak-ing them more robust (Hogan et al., 2004a;Forster et al.,2012). We propose a Bayesian natural cubic B-spline varying coefficient GLMM (BNSV) that can account for dropout in longitudinal studies with exponential family outcomes while avoiding assumptions about the distribution of dropout times or the functional form of the relationship between re-gression coefficients and dropout time. In addition, unlike PMMs, dropout can occur at any continuous point in time. In the BNSV, the smooth functions are modeled using natural cubic B splines (Forster et al., 2012). Similar models have been proposed for Gaussian outcomes using penalized splines (Su and Hogan, 2010).

Fitting this semi-parametric model in a Bayesian framework has several advantages. The number and location of spline knots control the smoothness, shape, and flexibility of the spline over the range of dropout times; however, fitting in a frequentist framework, it is unclear how to choose these parameters (Biller, 2000; Biller and Fahrmeir, 2001;

Eubank, 1999; Denison et al., 1998; Friedman and Silverman, 1989). Determining the “right” number and locations of knots by visual inspection of the data is impossible in most cases (Eubank, 1999); but despite evidence for the influence of knot locations on model fit, in practice it is common to simply place knots at even quantiles of the dropout distribution (Forster et al., 2012). Our Bayesian framework utilizes a reversible jump Markov chain Monte Carlo (RJMCMC) approach that jointly models the number and location of knots for the spline and does not require the choice of a single set of spline knots to make statistical inference. In addition, there is no need to specify a parametric distribution for the dropout times or to use an extra bootstrap simulation to estimate standard errors, as is required in the frequentist, semi-parametric approach.

We present results from simulation studies showing the BNSV reduces bias and mean squared error compared to standard linear mixed models (LMM) and GLMMs when non-ignorable dropout is present. We further apply the BNSV to the estimation of the impact

(20)

of hard drug use on longitudinal CD4+ _{T cell count using data from untreated subjects}

enrolled in the Women’s Interagency HIV Study (WIHS). In a second application, we examine the impact of recreational drug use on the odds of viral load suppression over time for HIV infected subjects that have initiated highly active antiretroviral therapy (HAART) in the WIHS.

II.3 Bayesian Varying Coefficient Models for Non-Ignorable Dropout

Let y = (y1. . . ym)′ be the set of outcomes observed on m subjects with ni

observa-tions each at times t = (t1. . . tm)′. Let u = (u1. . . um)′ be the set of m observed dropout

times, Θ be the set of model parameters associated with the longitudinal outcome process, and π be the set of model parameters associated with the dropout process. Assuming that Θ and π are independent, the distribution p(Θ, π|y, u) ∝ p(y, u|Θ, π)p(Θ, π) ∝ p(y|u, Θ)p(u|π)p(Θ)p(π) (Su and Hogan,2010). Therefore π is not part of the posterior for Θ, and inference on Θ depends only on the conditional likelihood, p(y|u, Θ). Infer-ence on π depends only on the marginal likelihood for the dropout time, p(u|π). Since the posterior distribution can be factored in this way, estimation of Θ can be performed without making distributional assumptions about u or π. First we describe the condi-tional model for y|u, which allows the change in the outcome over time to depend on dropout time and results in dropout time specific estimates. While inference conditional on u can be made without these assumptions about u and π, it is often of interest to summarize the results with a marginal or “dropout adjusted" estimate of the outcome that does not depend on dropout time, which requires integrating over the distribution of dropout times. We utilize Rubin’s Bayesian bootstrap method (Rubin, 1981) to flex-ibly model the distribution of dropout times and to calculate marginal estimates in a straightforward manner.

(21)

II.3.1 Model for Exponential Family Outcomes with Dropout-Varying Coeffi-cients

For exponential family outcomes, the observation specific conditional VCM is:

f (yij|ui, αi, ηij) = exp [(yijηij − b(ηij))/φ] c(yij, φ) (II.1)

µij = E[yij|ηij] = b′(ηij) (II.2)

g(µij) = ηij = β0+ β1(ui)tij + CijβC + Zijαi (II.3)

where g() is the link function, ηij is the linear predictor, Zij is the design matrix associated

with the random effects, αi, and φ is a scale parameter. For a model with a random

intercept and slope, let αi = α_α0i 1i ∼ N0₀ , σ20 σ01 σ01 σ12

. β0 is the intercept, and

β1(ui) is the dropout-varying slope. Cij is the design matrix associated with the covariate

effects, βC, which do not depend on dropout time. For example, for a binary categorical

outcome, the VCM is:

yij|µij ∼ Bernoulli(µij) (II.4)

µij = exp(ηij)/(1 + exp(ηij)) (II.5)

logit(µij) = ηij = β0+ β1(ui)tij + CijβC+ Zijαi (II.6)

For a normally distributed outcome, yij|ηij ∼ N(ηij, σǫ2) and ηij = µij = β0 +

tijβ1(ui) + CiβC + Ziαi. This model can easily be extended to include a dropout

time-varying intercept in addition to a dropout-time-varying slope.

II.3.2 Natural Cubic B-Splines

In the BNSV, the slope, β1(ui), in Equation II.3 is assumed to be a smooth function

of dropout time and is modeled using natural cubic B-splines (de Boor, 2001). The ith

(22)

subject’s dropout-varying slope is:

β1(ui) = D+1

X

k=1

θkB(u, D, l)˜ [i,k] (II.7)

Here D is the number of degrees of freedom for the dropout-varying component of the slope and ˜B(u, D, l) is the matrix of natural cubic B-spline basis functions evaluated at u with D + 1 knots (including 2 boundary knots) at locations l = {l1, ..., lD+1}, for D ≥ 1.

For D = 0, there is no dropout-varying effect and ˜B(u, D, l)[i,1] = 1 for all subjects.

θ = (θ1. . . θD+1) are the coefficients associated with the basis functions.

II.3.3 Dropout Time Model and Bayesian Bootstrapping

Rubin’s Bayesian bootstrap (Rubin,1981) is used to flexibly model the distribution of dropout times (Su and Hogan, 2010). Like frequentist bootstrap methods, this assumes that dropout times follow a multinomial distribution and that all possible dropout times have been observed in the data. This may be reasonable if subjects are seen at distinct numbered visits, but may be less so if dropout times are measured in continuous time. The Bayesian bootstrap repeatedly samples the proportion of subjects dropping out at each of the observed dropout times, rather than re-sampling the observed dropout times themselves, as would be done in a frequentist bootstrap. Define u0 _{= (u}0

1, ..., u0R) as

the R unique ordered observed dropout times. Let π = (π1, . . . , πR) be the vector of

probabilities of dropping out at each observed dropout time and N = (N1, . . . , NR) be the

number of subjects observed dropping out at each unique dropout time. The likelihood is proportional to QR

r=1πrNr. If we assume the prior distribution of π is proportional

to QR

r=1π−1r , the posterior distribution of π is proportional to

QR

r=1πrNr−1, which is the

kernel of a Dirichlet distribution. The posterior distribution of π is then Dirichlet with concentration parameters (N1, . . . , NR).

(23)

II.3.4 The Calculation of Marginal Effects

Working on the linear predictor scale, it is possible to calculate a marginal slope, averaged over both the distribution of dropout times and random effects. The expected value of the linear predictor at time t is:

E(ηij|t, C) =

Z Z

[β0+ β1(u)t + CβC+ Zα] dF (α)dF (u|C) (II.8)

= Z

[β0+ β1(u)t + CβC] dF (u|C) (II.9)

= β0+ CβC + t

Z

β1(u)dF (u|C) (II.10)

E(ηij|t, C) is also a linear function of time with slope β1′ = E[β1(u)|C]. If we assume

the distribution of dropout times does not depend on the covariates (F (u|C) = F (u)), then β′

1 = E[β1(u)], and the marginal slope can be estimated at each iteration of the

RJMCMC algorithm in a straightforward manner. At iteration s, β₁′(s)= πT (s)_β

1(u0)(s).

If the assumption that the distribution of dropout times does not depend on the covariates is inappropriate, it may not always be possible to easily estimate marginal slopes, particularly in more complex cases where the distribution of dropout times may depend on continuous covariates or several different covariates. However in simple cases, for example comparing the change in the outcome over time between treatment groups, marginal effects can be easily calculated, even if the distribution of dropout times depends on group. Here the group specific marginal dropout-varying slopes conditional on the observed dropout times in each group can be estimated as above, by essentially taking a weighted average of the dropout-varying coefficients over the unique dropout times observed in each group (Moore et al., 2015).

II.3.5 Priors

D is assumed to have a Poisson(λ) prior distribution. In addition to the Poisson, others have proposed negative binomial and discrete uniform priors for the dimension of

(24)

models in fitting other types of varying coefficient models, but found that using either of these two distributions instead of the Poisson resulted in mixing problems and limited exploration of models with larger numbers of knots (Biller,2000).

For knot locations, we assume a discrete set of M candidates, such as the order statistics of the observed drop out times. For a given D, all sets of knots are assumed to have the same prior probability, so that:

p(l1...lD+1|D) = M D + 1 −1 = (D + 1)!(M − D − 1)! M !

This method could easily be extended to a continuous range of candidate knot locations distributed as the order statistics from a continuous uniform distribution over the range of observed dropout times.

The fixed effect coefficients for the natural B spline basis functions are assumed to have a multivariate normal prior with mean zero, and independent covariance structure:

θ _{∼ MV N}D+1(0, R0)

where R0 = σβ2ID+1 x D+1. I(D+1)x(D+1) is a (D +1) x (D +1) identity matrix. In practice,

σ2

β is chosen to be large enough to be “non-informative.”

We assume (β0, βC)′ have a multivariate normal prior, M V N (0, σβ2I). Differing σ2β

may be chosen for each coefficient or block of coefficients. In addition, we assume an inverse Wishart prior for the covariance of the random effects, and for a normally dis-tributed outcome, an inverse gamma prior for the variance of the residual error, such that Σα= σ 2 0 σ01 σ11 σ21 ∼ IW (ν0, S) and σ2ǫ ∼ IG(a0, b0).

(25)

II.4 Estimation

II.4.1 Reversible Jump Markov Chain Monte Carlo

A reversible jump Markov chain Monte Carlo algorithm (Green, 1995) is used to fit the BNSV model. Given the current set of values at iteration (s) of the RJMCMC algorithm, a set of values for the (s + 1) iteration is simulated as follows:

1. Update the dimension of the model, D(s)_{, using a Metropolis-Hastings step.}

(a) Proposals of births and deaths occur randomly with probabilities b and d, respectively.

(b) In a birth step, a knot is randomly proposed for addition to the model. Since the dimension of the model increases, an additional coefficient is also drawn from a normal distribution centered around its estimate from a single iteration of the iterative weighted least squares (WLS) algorithm.

(c) In a death step, a knot and a coefficient are randomly proposed for deletion from the model.

(d) Since adding or deleting a knot results in changes in the B-spline basis func-tions, the fixed effects coefficients associated with the dropout-varying slope are adjusted based on their estimates from a single iteration of the iterative WLS, or using least squares estimates if the outcome is normally distributed. Details of this step are available in Appendix A.1.1.

2. Conditioned on the current number of knots, update the other parameters in the model using a Metropolis-Hastings within Gibbs sampler.

(a) Update knot locations using a Metropolis-Hastings step.

(b) Update fixed effects using a Metropolis-Hastings step with Gammerman’s WLS proposal (Gammerman, 1997).

(26)

(c) Sample or update random effects and variance components using Gibbs or Metropolis-Hastings steps depending on the distribution of the outcome and if appropriate conjugate priors are used.

3. Update π(s) _{in a Gibbs step and calculate the marginal slope or other marginal}

quantities of interest.

The details of each step can be found in Appendix A.1.

II.4.2 Implementation Issues

In the dimension change step, the B-spline basis functions must be recalculated and a coefficient must be added or deleted. Formulas exist for updating coefficients when knots are added or deleted from a spline (Lyche and Stom,1996); however, using deterministic rules, the required symmetry of birth and death steps in RJMCMC is destroyed (Biller,

2000), so a different method of updating coefficients is needed. In a birth step, others have proposed adding an additional coefficient by taking a weighted average of neighbor-ing coefficients based on a random uniform number and adjustneighbor-ing the other coefficients accordingly (Biller, 2000). Applying this method to the BNSV to account for dropout results in low acceptance ratios, unless a large mean is used for the prior for the number of knots, as the proposed coefficients often fit the data poorly. Similarly, proposing a new coefficient from the prior distribution can also result in low acceptance of birth and death moves. Our approach to adding a coefficient described in Appendix A.1.1 is based on the WLS estimates of the coefficients when l(s), the current set of knot locations, and l∗, the proposed set of knot locations, are used for the natural cubic B spline transformation of dropout times and results in better mixing and higher acceptance rates (≈1-2%) in our dropout applications.

The method is not sensitive to starting values for the number or location of spline knots or for the coefficients, as Gammeran’s WLS proposal typically has high acceptance

(27)

the number of knots included in the model, as well as how often births and deaths of knots are accepted. In addition to the prior mean for D, the values of b and d, σ2

β and

the variance of the proposal distribution when a new coefficient is added to the model all influence the acceptance probability of birth and death steps. In particular, choosing σ2

β to be very large in order to be “uninformative" can result in low acceptance of birth

moves and models with few or no knots; on the other hand, constricting the prior for the coefficients to a small range around 0 can result in the addition of many small coefficients into the model. For our problems, coefficients are typically small (absolute value less than 4) and σ2

β of 25 to 100 appear strike a good balance between these two extremes. In our

application, we favor smoothness and parsimony and typically choose a relatively small mean of 5 for D. We use b, d, and the variance of the proposal distribution when a new coefficient is added to the model in order to tune the acceptance probability of birth and death steps. We assess the convergence the MCMC chain by trace plots of the dropout time specific and marginal slope, as well as for the variance, intercept and covariate effects that do not involve varying numbers of spline components.

II.5 Simulation Study

II.5.1 Models

The BNSV method was used to fit models for four different simulated dropout sce-narios, including two normally distributed outcomes and two binary outcomes. Two different forms of the dropout mechanism were considered: (i) a continuous and smooth function meeting assumptions of the BNSV and (ii) a discontinuous step function. In ad-dition, simulations for linear dropout-varying slopes and no dropout effect are presented in Appendix A.2 to show the BNSV can also fit CLMs and GLMMs.

More specifically, the following form for the data was assumed: ηij = β0+ β1(ui)tij+

α0i+ α1itij, i = 1...m, j = 1...ni for m subjects with ni observations for the ith subject,

where (α0i, α1i)′ ∼ N(0, Σα). For the Gaussian simulations, yij|ηij ∼ N(ηij, σepsilon2),

(28)

and β0 = 0. Dropout times were u = U/15 ∈ [0, 1], resulting in 16 time points

spaced equally from 0 to 1. Uniform dropout was created from a beta-binomial where p ∼ Beta(1.5, 1.5) and U ∼ Bin(15, p). The within-subject variance, σepsilon2, was

set at 0.067. The elements of Σα were as follows: σ02 = 0.4, σ12 = 0.01 and σ01 =

−0.01 (Hogan et al., 2004a; Forster et al., 2012). For the binary simulations, yij|ηij ∼

Bernoulli(logit−1(ηij)), β0 = −3, and for stability, dropout began at the third

observa-tion. The elements of Σα were as follows: σ20 = 0.4, σ12 = 0.1 and σ01 = −0.01. The

forms of the dropout-varying slope were: Normal (i): β1(u) = −3 exp(−4u), Normal (ii):

β1(u) = I(u>2/3), Binary (i): β1(u) = 10[1−2 exp(−4u)], Binary (ii): β1(u) = 4+61(u>2/3)

(Figure II.1). The dropout effects in these scenarios were similar to those seen in the WIHS and other typical HIV cohort studies. For each simulation scenario, 1,000 datasets with 400 subjects each were created.

II.5.2 Methods of Evaluation

BNSV models were fit to each dataset, as well as a standard GLMM that did not account for dropout. The performance of the methods was evaluated graphically and in terms of bias, variance, and mean square error for the marginal slope, estimated by the posterior mean. All analyses were implemented in R using a custom RJMCMC algorithm utilizing the splines, MASS, mvtnorm, MCMCpack, and gtools packages.

II.5.3 Implementation

For the BNSV, a maximum of 10 degrees of freedom were considered for the dropout-varying component of the slope, for a maximum of 11 total degrees of freedom for the slope. The intercept did not vary with dropout time. The prior mean for the number of degrees of freedom for the dropout-varying component of the slope was set to 5, and the prior variance for the coefficients was set to 25 for the normal simulations, and 100 for the binary simulations. The prior for Σα was IW (3, I) and the prior for _σ12

ǫ was IG(0.001, 0.001). The probability of proposing a birth/dimension increase was 0.2. The

(29)

MCMC chain was initiated with five equally spaced knots (including the 2 boundary knots) and coefficients set to their least squares or WLS estimates. Random effects were started at 0. Chains were run for 40,000 iterations with a burn in of 10,000 without thinning.

II.5.4 Results

The BNSV method was able to fit the dropout-varying slope accurately for forms that met as well as forms that violated model assumptions (Figure II.1). The performance of the method was quantified in terms of bias, variance, and mean squared error (MSE) for the marginal slope (Table II.1). While the GLMM that did not account for dropout had lower variance due to the increased flexibility and additional parameters included in the BNSV model, the BNSV had lower bias and MSE for the marginal slope when non-ignorable dropout was present.

II.6 Applications to the WIHS

We applied the BNSV method to investigate the impact of drug use on longitudinal HIV outcomes in the WIHS. The WIHS is an ongoing prospective study of the natural and treated histories of HIV infection in women, with behavioral data and specimens collected at semiannual visits by multiple sites (Barkan et al., 1998). The WIHS has enrolled 3,766 (2,791 HIV+ and 715 with AIDS at baseline) women since 1994. At the beginning of the HIV epidemic, research focused on male populations at high risk for AIDS - predominantly white, urban, homosexual and bisexual men. WIHS was estab-lished in response to dramatic increases in new AIDS cases among women in the 1990’s (Bacon et al.,2005). In contrast to the male populations, HIV and AIDS are more preva-lent among women of color exposed through heterosexual partners or intravenous drug use (IDU) (Bacon et al., 2005; Centers for Disease Control and Prevention, 2002). For untreated subjects, we hypothesized that hard drug users would have steeper declines in CD4+ _{T cell count compared to other untreated subjects in the cohort. For HIV+}

(30)

subjects that have initiated highly active antiretroviral therapy (HAART), the primary measure of treatment effectiveness is suppression of viral load (HIV-1 RNA below detec-tion limits). We hypothesized that recreadetec-tional drug users would have slower increases in the odds of viral load suppression compared to other subjects in the cohort.

II.6.1 Methods

We utilized the BNSV method to compare longitudinal changes in CD4+ _count

be-tween consistent hard drug users and other untreated subjects and to compare viral load suppression between consistent recreational drug users and other treated subjects in the WIHS while accounting for dropout. Subjects were classified as consistent hard drug users if they reported injection drug use (IDU) or non-IDU cocaine, opiate or amphetamine use at 50% or more of visits combined with use within the last year before dropout. Subjects were classified as consistent recreational drug users if they reported marijuana or injection drug use (IDU) or non-IDU cocaine, opiate, amphetamine, or other drug use at 50% or more of visits combined with use within the last year before dropout. Dropout time was calculated as the day of the last visit + 1. Descriptive statistics are presented in Table II.2.

Ln(CD4+_{) was modeled for untreated subjects from the initial WIHS cohort (first}

recruitment period) for the first 5 years of the study, beyond which many of the subjects had missing data. If a subject remained on study for longer than 5 years a dropout time of 5 years + 1 day (1826 days) was assigned. In addition to hard drug use, baseline ln(CD4+_{) and its interaction with time were included as covariates in the model. Viral}

load suppression was modeled for subjects that initiated treatment between 1995 and 2000 for all visits up to 11 years after initial treatment initiation, when many subjects no longer had available data. Again, if a subject remained on study for longer than 11 years a dropout time of 11 years + 1 day was assigned. Since detection limits of viral load assays changed over time, viral loads under 400 copies/mL were considered “undetectable".

(31)

Baseline ln(CD4+_{) and log}

10(viral load) (measurements preceeding treatment initiation)

and their interactions with time were included as covariates in the model.

II.6.1.1 Implementation

Different dropout-varying slopes were allowed for drug users and other subjects. The RJMCMC chains were run for 200,000 iterations, with a burn in of 50000 iterations. A Poisson prior with a mean of 5 was used for the number of knots in the model. Spline coefficients and covariates were updated in separate blocks. Normal distributions with mean 0 and variance of 100 were used as priors for the coefficients to be “non-informative.” Slopes on the linear predictor scale, averaged over dropout time, were calculated using the Bayesian bootstrap method described in Sections 2.3-4. The distribution of dropout times was not assumed to depend on the covariates. For comparison, GLMMs that did not account for dropout were fit to the data using a similar MCMC estimation algorithm.

II.6.2 Results

II.6.2.1 Longitudinal CD4+ _Count

Consistent hard drug users tended to dropout of the study earlier and were more likely to dropout due to death (Table II.2). Analyses accounting for dropout show that overall, hard drug users had more rapid declines in CD4+ _{count than those who did not}

use hard drugs (Figure II.2). Assuming a baseline CD4+ _{count of 478.5 (median), hard}

drug users CD4+ _{counts declined by 33.5% per year (95% CI: 25.0-41.2) compared to}

17.8% (95% CI: 14.9-20.7) for others in the WIHS (Table II.3). Comparing these results to a linear mixed-effects model, declines in CD4+ _{were steeper and the magnitude of the}

difference between hard drug users and non-users was larger in the BNSV model (Figure II.2). Using a linear mixed model, hard drug users were found to have 22.4% declines in CD4+ _{count per year (95% CI: 17.8-26.8) compared to 14.6% (95% CI: 12.1-17.0) in}

others.

(32)

II.6.2.2 Longitudinal Viral Load Suppression

Consistent recreational drug users also tended to dropout of the study earlier and were more likely to dropout due to death than other subjects that initiated HAART (Table II.2). The average change in the log odds of viral load suppression per year assuming a median baseline CD4+ _{count of 267 and log}

10(viral load) of 4.2 are presented in Table

II.3. The probability of suppression for a subject with the average slope, baseline CD4+

count of 267 and log(viral load) of 4.2 are shown in Figure II.3. For both recreational drug users and other subjects that initiated HAART, the estimated probability of viral load suppression as well as the change in the odds of suppression over time were reduced using the BNSV method to account for dropout compared to a standard GLMM; however, estimated differences in the change in odds of suppression over time between drug users and others are similar for the two models. For a recreational drug user with baseline CD4+ _{count of 267, log(viral load) of 4.2, and random effects of 0, the odds of viral load}

suppression increased by 1.07 times per year (95% CI: 0.93 to 1.21), compared to 1.12 times per year (95% CI: 1.06 to 1.19) for a subject with the same covariates that did not use recreational drugs. Using a standard GLMM that did not account for dropout, these estimates were 1.23 (95% CI: 1.15 to 1.32) and 1.29 (95% CI: 1.24 to 1.34) respectively. While recreational drug users had smaller increases in the odds of suppression per year, this difference was not statistically significant (Table II.3).

II.6.3 Sensitivity Analyses

The results of these analyses rely on the assumption that subjects continue on the same linear trajectory after their dropout. We test the sensitivity of our results to this assumption by considering a proportional attenuation of the slope after a subject’s drop out, such that after dropping out, a subject’s slope becomes δβ1(ui) (Appendix A.3).

For CD4+ _{declines, while the estimates of the differences between drug users and others}

(33)

CD4+ _{counts at years 1 to 4 than other untreated subjects. For viral load suppression,}

the odds of suppression remain lower for consistent recreational drug users compared to other subjects that initiated HAART for δ = 0, 0.25, 0.5, 0.75, however, as in the primary analysis, differences between drug users and others were not statistically significant.

II.7 Discussion

We propose a flexible, semi-parametric natural cubic B-spline varying coefficient method to account for dropout in a Bayesian framework. The BNSV extends exist-ing frequentist natural cubic B-spline varyexist-ing coefficient methods to account for dropout in longitudinal studies with a Gaussian outcome (Forster et al.,2012;Moore et al.,2015) to other non-normal outcomes in the exponential family, while also allowing the number and location of spline knots to be jointly modeled with other model parameters, remov-ing dependence on the choice of spline knots and more accurately characterizremov-ing model uncertainty. The BNSV allows for dropout occurring at any continuous point in time and avoids making parametric assumptions about the distribution of dropout times or the functional form of dropout-varying slope. As the method jointly models the number and location of splines knots for the dropout-varying slope and marginalizes results over the number and location of knots, results are more robust and less dependent on user choices. Results of our simulation studies show that the BNSV reduces bias and mean squared error for the marginal slope compared to standard GLMMs when non-ignorable dropout is present. The BNSV can also accurately fit models with a linear dropout-varying effect or no dropout-varying effect.

Applying the BNSV to the analysis of viral load suppression in treated subjects in the WIHS, we find lower odds of viral load suppression when accounting for dropout compared to using a standard GLMM. In the analysis of untreated CD4+ _{count in the}

WIHS, after accounting for dropout, we find important differences in estimates of CD4+

decline for both hard drug users and others compared to a standard LMM. In addition,

(34)

since differential dropout is present, we also find a larger difference in CD4+ _declines

between hard drug users and others.

Table II.1: Simulation Study Comparing BNSV and GLMM Methods: Posterior Mean Marginal Slope Estimates, Bias, Variance and Mean Squared Error (MSE)

BNSV GLMM

Distribution Form Slope Bias Variance MSE Slope Bias Variance MSE

Normal (i) -0.64 0.06 0.01 0.02 -0.21 0.49 0.001 0.24 (ii) 0.35 -0.002 0.01 0.01 0.65 0.31 0.002 0.10 Binary (i) 7.49 0.13 0.19 0.20 9.09 1.74 0.09 3.11 (ii) 6.67 -0.05 0.17 0.18 8.56 1.85 0.14 3.55 0.0 0.2 0.4 0.6 0.8 1.0 −3.0 −2.0 −1.0 0.0

(i) Normal − Slope

Dropout Time Slope 0.2 0.4 0.6 0.8 1.0 2 4 6 8 10

(i) Binary − Slope

Dropout Time Slope 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

(ii) Normal − Slope

Dropout Time Slope 0.2 0.4 0.6 0.8 1.0 4 5 6 7 8 9 10

(ii) Binary − Slope

Dropout Time

Slope

BNSV Truth

Figure II.1: Simulation Study of the BNSV Method: Posterior Mean Estimates of the Dropout-Varying Slope

(35)

Table II.2: Demographic Characteristics of Untreated Subjects with HIV Disease in the WIHS by Drug Use Group Untreated

Other (N=566) Drug User (N=248)

Mean (SD) Median (IQR) Mean (SD) Median (IQR)

or N or Percent or N or Percent

Age (years) 35.6 (8.0) 35.1 (30.2-40.4) 38.0 (7.1) 37.8 (33.0-42.5)

Baseline ln(CD4+₎ _{6.0 (1.1)} _{6.2 (5.8-6.5)} _{5.9 (1.0)} _{6.0 (5.6-6.4)}

Baseline log10(Viral Load) 3.9 (1.1) 4.0 (3.1-4.7) 4.3 (0.9) 4.3 (3.8-4.9)

Dropout Time (days) 959.4 (613.4) 713.5 (394.2-1818.0) 670.8 (584.8) 404.5 (208.8-882.5)

Minority 476 84.1 201 81.0

Dropout Reason: Died 24 4.2 28 11.3

Lost to Follow Up 162 28.6 44 17.7

Started Treatment 380 67.1 176 71.0

HAART

Other (N=785) Drug User (N=230)

Mean (SD) Median (IQR) Mean (SD) Median (IQR)

or N or Percent or N or Percent

Age (years) 39.1(8.0) 38.7 (33.7-43.6) 40.3 (7.3) 40.5(34.9-45.7)

Baseline ln(CD4+₎ _{5.3 (1.2)} _{5.6 (5.0-6.1)} _{5.3 (1.2)} _{5.5 (4.8-6.0)}

Baseline log10(Viral Load) 4.0 (1.1) 4.1 (3.3-4.9) 4.1 (1.2) 4.3 (3.4-5.0)

Dropout Time (days) 3096 (1346.6) 4015.0 (2136.0-4015.0) 2594 (1543.2) 3485 (1112.0-4015.0)

Minority 662 84.3 175 76.1

Dropout Reason: Died 150 24.2 80 34.8

Lost to Follow Up 595 75.8 190 65.2

Table II.3: Estimated Changes in ln(CD4+) and ln(Odds of Viral Load Suppression) per Year for Untreated Subjects in the

WIHS Using the BNSV and GLMM Methods. Changes in ln(CD4+_{) assume a baseline CD4}+ _{of 478.5. Changes in ln(odds)}

assume baseline CD4+ _{of 267 and baseline log}

10(viral load)=4.2 .PM=posterior mean, PP=posterior probability of a difference

< 0/lower slopes among drug users.

GLMM BNSV

PM 95% Credible Interval PP PM 95% Credible Interval PP

A) Untreated: ∆ ln(CD4+)

Hard Drug Users -0.254 -0.313 -0.196 -0.408 -0.531 -0.288

Others -0.158 -0.187 -0.129 -0.196 -0.231 -0.161

Difference -0.096 -0.160 -0.0312 0.998 -0.212 -0.338 -0.089 0.9996

B) HAART:

∆ ln(Odds of Suppression)

Recreational Drug Users 0.206 0.137 0.275 0.071 -0.075 0.190

Others 0.255 0.219 0.292 0.116 0.060 0.170

Difference -0.049 -0.125 0.026 0.896 -0.045 -0.200 0.086 0.752

(36)

0 1 2 3 4 5 100 200 300 400 500

A) Accounting for Dropout

Time (Years)

CD4 Count

Hard Drug Users 95% CI 0 1 2 3 4 5 100 200 300 400 500

B) Not Accounting for Dropout

Time (Years)

CD4 Count

Others 95% CI

Figure II.2: A) BNSV and B) LMM Posterior Mean CD4+ _{Count and 95% Credible}

Interval (CI) over Time in Untreated Subjects in the WIHS, Assuming a Baseline CD4+

(37)

0 2 4 6 8 10 0.0 0.2 0.4 0.6 0.8 1.0

A) Accounting for Dropout

Time (years)

Probability of Suppression

Recreational Drug Users 95% CI 0 2 4 6 8 10 0.0 0.2 0.4 0.6 0.8 1.0

B) Not Accounting for Dropout

Time (years)

Probability of Suppression

Others 95% CI

Figure II.3: A) BNSV and B) GLMM Posterior Mean Probability of Suppression and 95% Credible Interval (CI) over Time in a Subject that Initiated HAART in the WIHS, Assuming a Baseline CD4+ _{count of 267, Baseline log}

10(viral load)=4.2, and Random

Effects = 0.

(38)

CHAPTER III

A DIRICHLET PROCESS MIXTURE MODEL FOR NON-IGNORABLE DROPOUT

III.1 Summary

Longitudinal cohorts are a valuable resource for studying HIV disease progression, however dropout is common in these studies, with subjects often failing to return for vis-its due to disease progression, loss to follow-up, or death. When dropout depends on un-observed outcomes, data is missing not at random and results from standard longitudinal data analyses can be biased towards subjects that remain on study, who likely have more favorable outcomes. Several methods have been proposed to adjust for non-ignorable dropout; however, many of these approaches rely on parametric assumptions about the distribution of dropout times and the relationship between the outcome and dropout time. We propose a Bayesian semi-parametric Dirichlet process mixture model to relax these assumptions and provide more accurate inference when parametric assumptions are violated by non-parametrically modeling the distribution of subject-specific effects as well as the distribution of dropout times. Results from simulation studies as well as an application to a publicly available longitudinal HIV cohort study database are presented.

III.2 Introduction

Longitudinal studies are critical to understanding disease progression over time; how-ever, obtaining complete data on all subjects can be challenging. This is particularly true in longitudinal HIV cohorts, and it is well documented that many subjects in these stud-ies have missing observations due to death or disease progression, leading to concerns of non-ignorable dropout (Lanoya et al.,2006). Dropout is not ignorable when missingness depends on the values of the unobserved outcomes, even after conditioning on the avail-able data (Little and Rubin,2002). In this scenario, standard longitudinal data analyses

(39)

ciated with comparing laboratory markers of HIV disease progression between hard drug users and other subjects in the Multicenter AIDS Cohort Study (MACS). Since subjects tend to drop out due to declining health, results from traditional longitudinal analysis methods may be overly optimistic, as they are biased towards study completers, who tend to have more favorable outcomes. In addition, since dropout and missing data are more common among drug users, differences between drug users and other subjects may also be underestimated. In addition to HIV cohort studies, similar dropout related chal-lenges have been identified in quality of life data from clinical trials of cancer therapies (Fairclough et al., 1998), anti-depressant clinical trials (Molenberghs et al., 2004), and studies of smoking cessation programs (Hogan et al.,2004b), among others. Having flex-ible approaches with minimal assumptions about the patterns or distribution of dropout is useful when considering how missing data may influence the results of longitudinal data analyses.

Selection (Heckman, 1979; Diggle et al., 1994; Heckman, 1998), frailty (Follmann and Wu, 1995; DeGruttola and Tu, 1994; Albert and Follmann, 2000; Schluchter, 1992;

Lancaster and Intrator, 1998) and mixture model (Rubin, 1977; Wu and Bailey, 1989;

Little,1994,1993) frameworks have been proposed to adjust for potentially non-ignorable dropout. While all methods for dealing with missing data must make some unavoidable assumptions about the behavior of unobserved outcomes, most methods rely on additional parametric assumptions. For example, in frailty model frameworks, it is usual to make a parametric or proportional hazards assumption about the distribution of dropout times. Less obviously, this also typically implies an assumption about the functional form of the relationship between the outcome and the distribution or hazard of dropout (Daniels and Hogan, 2008).

In order to relax these assumptions, we propose a semi-parametric Dirichlet-process mixture model for non-ignorable dropout in longitudinal studies with exponential family outcomes (DP-Drop). The DP-Drop method models the data as a mixture of simple

(40)

frailty models. We utilize the frailty model proposed by Schluchter et al (Schluchter et al.,

2001), in which the subject-specific random effects (intercept and slope) and the natural log of dropout time have a multivariate normal distribution. In addition to assuming a log-normal distribution of dropout times, this simple frailty model also assumes a linear relationship between the log of dropout time and the dropout time specific intercepts and slopes. This defines a very specific relationship and the model may not be accurate if this assumption is violated. Using a Dirichlet process mixture of these simple models, rather than a single model, relaxes these assumptions and allows the distribution of dropout times as well as the relationship between the outcome and dropout time to be more flexibly modeled.

Dirichlet process mixture models are commonly used in semi- and non-parametric Bayesian approaches to model distributions as a mixtures of unknown and potentially infinite numbers of normal distributions (Ghosal, 2010; Christensen et al., 2011). In our semi-parametric Dirichlet-process method, subject-specific effects and the natural log of subject’s dropout time are assumed to come from a Dirichlet process mixture of multivariate normal distributions. If non-ignorable dropout is present, this model is able to cluster subjects with similar subject-specific effects and dropout times. This method allows for a nonparametric distribution of subject-specific effects and for the subject-specific effects to be flexibly associated with dropout time. In addition, there is no need to specify a parametric form for the relationship between dropout time and the subject-specific effects, as in the conditional linear model (Wu and Bailey, 1989), or to a priori group subjects with subjectively similar patterns of missing data together, as in a pattern mixture model (Pauler et al.,2003). The distribution of dropout times is also non-parametrically modeled as a mixture of log-normal distributions.

We present results of a simulation study which show that the DP-Drop method reduces bias and mean squared error for the marginal slope compared to 1) standard generalized linear mixed models (GLMM), 2) Dirichlet process mixture models that allow a

(41)

non-parametric random effects distribution, but do not consider dropout time, and 3) simple frailty models when non-ignorable dropout is present. We demonstrate the benefits of the DP-Drop method by examining the impact of hard drug use on CD4+ _{T cell decline in}

untreated HIV infected subjects enrolled in the Multicenter AIDS Cohort Study (MACS).

III.3 Methods

Several authors have described the use of Dirichlet process priors to model random ef-fects distributions non-parametrically in hierarchical and generalized linear mixed models (Kleinman and Ibrahim, 1998b,a; Bush and MacEachern, 1996). The Dirichlet process (Ferguson, 1974, 1973) is a distribution over probability measures, characterized by a concentration parameter, α > 0, which describes the prior precision or clustering of the distribution, and a baseline distribution, P0. The Dirichlet process draws distributions

from around P0, the expected value of the Dirichlet process; however, these draws are

almost surely discrete, even if P0 itself is continuous. Placing a Dirichlet process prior

on a random effects distribution P allows the random effects distribution to be unknown and clusters the individuals into groups defined by their random effect values (Dunson,

2010).

While the Dirichelet process prior allows P to have an unknown distribution, P is necessarily discrete, with ties in the random effect values. It may be more realistic to cluster subjects into groups with similar but not identical random effect values. This can be accomplished by characterizing the random effects distribution as a Dirichlet process mixture of normal distributions (Lo, 1984; Escobar and West, 1995; Muller et al., 1996;

Ohlssen et al.,2007).

In addition, Dirichlet process mixture priors have been incorporated into joint models for longitudinal and time to event data to add flexibility into the random effects distri-bution of the longitudinal process (Brown and Ibrahim, 2003; Linero and Daniels, 2015;

Rizopoulos and Ghosh, 2011; Tang et al.,2014). However, the survival or time to event

(42)

portion of these models often still depends on parametric or proportional hazards assump-tions. Dirichlet process mixtures have also been used to flexibly model time to event data (Ibrahim et al., 2004; Johnson and Christensen, 1986; Christensen and Johnson, 1988). Our method to account for dropout uses the Dirichlet process mixture approach for both the longitudinal and dropout time portions of the model, resulting in increased flexibility and relaxing assumptions about both the distribution of subject-specific effects and the distribution of dropout times. Our method clusters subjects into groups based on both their random effects values and dropout times. While as it is presented, our method assumes dropout times are known exactly, it could be readily extended to account for ad-ministrative censoring and covariates using methods described by De Iorio et al (De Iorio et al., 2009). We first describe the model for longitudinal exponential family outcomes with subject-specific effects, and then explain how to obtain dropout time specific and marginal effects, integrated over the distribution of dropout times.

III.3.1 Model

For m subjects with ni observations, let yi = (yi1, . . . , yini) be the vector of outcomes for subject i observed at times ti = (ti1, . . . , tini). Assume each subject has a dropout time ui. We consider an exponential family framework, such that:

y_i_|η_i_{, φ ∼ EF(η}_i, φ) (III.1)

η_i = 1iβ0i+ tiβ1i+ CiβC (III.2)

where EF is an exponential family distribution such that

f (yij|ηij) = exp [(yijηij − b(ηij))/φ] c(yij, φ) (III.3)

µij = E[yij|ηij] = b′(ηij) (III.4)

(43)

where g() is the link function, ηij is the linear predictor, and φ is a scale parameter. 1i

is an ni x 1 vector of 1’s and Ci is the design matrix for covariate effects βC, which

are common across subjects. In our semi-parametric Dirichlet process mixture model for dropout, the subject-specific coefficients (β0i, β1i) and log(ui) are flexibly modeled as a

mixture of an unknown, potentially infinite, number of normal distributions: " _β 0i β1i log(ui) # ∼ Z N(µ, Σ)dG(µ) (III.6) G|α, µ0, Σ0 ∼ DP(α, N(µ0, Σ0)) (III.7)

The mixing distribution G has a Dirichlet process prior, with concentration parameter α and a normal baseline distribution with mean, µ₀, and variance, Σ0. An equivalent

model, based on the stick breaking formulation of the Dirichlet process (Sethuraman,

1994) can be formulated by replacing equations III.6 and III.7 with:

G = ∞ X k=1 pkN(µk, Σ) (III.8) " _β 0i β1i log(ui) # |ci = k ∼ N(µk, Σ) (III.9) pk = Pr(ci = k) = Vk k−1 Y l=1 (1 − Vl) (III.10) Vk ∼ Beta(1, α) (III.11) µ_k_|µ₀, Σ0 ∼ N(µ0, Σ0) (III.12)

where ci indicates to which cluster or normal distribution subject i belongs, {pk}∞k=1 are

the probabilities of belonging to cluster k and {Vk}∞k=1 are the stick breaking weights

or conditional probabilities of belonging to cluster k given that the subject was not in clusters 1 to k−1. µk = (µ0k, µ1k, µuk)′ is the mean for cluster k and Σ =

"_σ2

0 σ01 σ0u

σ01 σ12 σ1u

σ0u σ1u σu2

#

is a common covariance matrix for all clusters. This model could easily be extended to

(44)

allow for cluster-specific covariances, Σk, although it may be difficult to estimate these

parameters in clusters with small numbers of subjects. Using this formulation, it is straightforward to determine the conditional distributions for a Gibbs sampler.

III.3.2 Priors and Hyperpriors The following priors are assigned:

β_C _{∼ N(0, R}0) (III.13)

Σ_|ν₀_{, T}₀ _{∼ IW(ν}₀_{, T}₀₎ _(III.14) µ₀_|mb, Sb ∼ N(mb, Sb) (III.15)

Σ₀_|ν_b_{, T}_b _{∼ IW(ν}_b_{, T}_b₎ _(III.16)

α ∼ Gamma(a0, b0) (III.17)

For normally distributed outcomes, the withsubject error variance is assigned an in-verse gamma prior, σ2

ǫ|τ1, τ2 ∼ IG(τ1, τ2).

III.3.3 Estimation

The posterior distribution is not analytically tractable and must be approximated. Markov chain Monte Carlo (MCMC) is a common approach (Ishwaran and James,2001;

MacEachern, 1994; Jain and Neal, 2004), although other alternatives, including par-tial predictive recursion (Newton and Zhang, 1999), variational Bayes approximations (Blei and Jordan,2006), sequential importance sampling (MacEachern et al.,1999), and weighted Chinese restaurant sampling (Ishwaran and Takahara, 2002) have been pro-posed. A blocked Gibbs sampler for non-parametric random effects distributions has been described in detail by Ishwaran and James (Ishwaran and James, 2001). The method approximates the posterior using a truncation of the stick breaking model of equations III.8-III.12. Since the probability weights assigned to each cluster decrease as the number