• No results found

Applications of zero-heavy finite mixture models to identification of exacerbation-prone patients in COPDGene

N/A
N/A
Protected

Academic year: 2021

Share "Applications of zero-heavy finite mixture models to identification of exacerbation-prone patients in COPDGene"

Copied!
108
0
0

Loading.... (view fulltext now)

Full text

(1)

APPLICATIONS OF ZERO-HEAVY FINITE MIXTURE MODELS TO IDENTIFICATION OF EXACERBATION-PRONE PATIENTS IN COPDGENE

by

Melissa E Lowe

B.A. Wesleyan University, 2017

A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment

of the requirements for the degree of Masters

Biostatistics Program 2020

(2)

This thesis for the Master’s degree by Melissa E. Lowe

has been approved for the Biostatistics Program

by

Elizabeth Juarez-Colunga, Advisor

(3)

Lowe, Melissa E. (MS, Biostatistics)

Applications of Zero-Heavy Finite Mixture Models to Identification of Exacerbation-Prone Patients in COPDGene

Thesis directed by Associate Professor Dr. Elizabeth Juarez-Colunga, PhD.

ABSTRACT

Identifying latent subgroups in diseases with heterogeneous presentations can be difficult. It is clinically hypothesized that there are multiple types of patients prone to severe ex-acerbation in Chronic Obstructive Pulmonary Disease. Few studies have considered the longitudinal trajectories of such patients. We applied longitudinal finite mixture models to data from the Longitudinal Follow-up Survey from the COPDGene study with several potential mixture structures to identify subgroups of exacerbators. Results from the hur-dle model component indicated that baseline spirometric ratios, self-identified race, prior exacerbation history, worse dyspnea, and non-smoking status at baseline were associated with increased probability of at least one exacerbation. Results from applied Poisson mix-ture models indicated that total exacerbation count is a portion of the distinction between groups of exacerbators but varying frequency of exacerbation occurrence appears to dominate any identifiable subgroups.

Key Words : prospective cohort; longitudinal count data; zero-inflated models; latent-class models; hurdle models; Poisson regression; COPD;.

The form and content of this abstract are approved. I recommend its publication. Approved: Elizabeth Juarez-Colunga

(4)

Acknowledgements

Thank you to Dr. Elizabeth Juarez-Colunga for her guidance and for advising this thesis, and to my committee members for providing feedback throughout this process. Thank you to the COPDGene investigators for allowing me to use the data from their study.

I also need to acknowledge the invaluable resource of Angela Moss’s thesis work published in her 2015 paper(31). It informed the development of my own models as well as the structure for testing and fitting said models.

Additionally, I wanted to express my deep gratitude to Dr. James Crooks for saving me with helpful tips on how to work with JAGS and R as well as being an excellent boss and mentor.

Finally, thank you to the Department of Biostatistics for funding my graduate education and providing me with the tools and knowledge to complete this thesis.

(5)

Specific Aims

AIM 1:

To summarize the baseline characteristics from the COPDGene Phase 1 cross-sectional data and evaluate key longitudinal events gathered in the Longitudinal Follow-Up Survey from subjects involved in the study including immediate drop-outs and subjects followed for more than 100 months.

AIM 2:

To develop a method of identifying subgroups of disease using simulation studies where characteristics of such subgroups may predict acute respiratory exacerbations. We hypothesize that finite mixture models provide a useful avenue of classification and apply simulation studies to evaluate these models for unbalanced longitudinal count data

AIM 3:

To apply methodology for finite mixture models to data from the COPDGene study which has a rich multi-year record of exacerbation counts in a heterogeneous cohort of 10,000 patients. We apply longitudinal mixture models to data from COPDGene and compare the results to those from the simulation studies.

(6)

TABLE OF CONTENTS

Chapter 1:

Introduction ...

1

1.1 COPD and Exacerbations . . . 2

1.2 Statistical Literature Review . . . 5

Chapter 2: Descriptives...

15

2.1 Cohort Description by Exacerbation Count. . . 18

Chapter 3: Methods and Model Development ...

27

3.1 Hurdle Model . . . 27

3.2 Mixture Models . . . 28

3.3 Bayesian Methods and Prior Selection . . . 29

3.3.1 Priors . . . 32

3.4 Model Convergence and Selection . . . 33

3.5 Simulations . . . 34 Chapter 4: Simulations...

36

4.1 Model 1 . . . 36 4.2 Model 2 . . . 39 4.3 Model 3 . . . 42 4.4 Model 4 . . . 45 4.5 Model 5 . . . 48 4.6 Null Results . . . 51

(7)

Chapter 5:

Applications and Results ...

57

5.1 Hurdle Models and COPDGene . . . 57

5.2 Finite Mixture Models and COPDGene . . . 59

5.2.1 Model 1 . . . 61 5.2.2 Model 2 . . . 63 5.2.3 Model 3 . . . 64 5.2.4 Model 4 . . . 66 5.2.5 Model 5 . . . 68 Chapter 6: Conclusion...

70

6.1 Discussion . . . 70

(8)

List of Figures

1 Random sample of the follow-up trajectories of 60 COPDGene patients. Each dot represents an attempted contact. Pink dots represent a failed contact. The darker blue dots represent one or more exacerbations and black dots represent a death. Subjects are ordered by follow-up time in months where 0 is their Phase 1 initial visit. . . 4 2 Graphs showing cumulative distribution functions (top) and probability mass

functions (bottom) of a Poisson λ = µ = 5 on the left and λ = µ = 10 on the right in red and several negative binomials with precision parameters varying between 0.1 and 50 . . . 6 3 Distribution of exacerbation rates in the overall cohort including those without

exacerbations during the follow-up period. Y-axis is on the log scale. Note that the maximum recorded exacerbation rate is 0.78. 25% is 0. 75% is 0.01. 95% is 0.06. 99% is 0.15. . . 21 4 Figures showing follow-up history of subjects in different quantiles of

exac-erbation rates. Top left: random sample of subjects from full-cohort. Top right: random sample of subjects in the 25th-75th quantiles of exacerbation rates, ranging from 0 to 0.01 events per month. Bottom left: random sample of subjects in the 75th-95th quantile of exacerbation rates, ranging from 0.01 to 0.06 events per month. Bottom right: random sample of subjects with the top 5% of exacerbation rates, ranging from 0.06 to 0.78 events per month. . 26 5 Graphs showing the density function from the 200 data sets run in Simulation

1. Top Left ”a”, Top Right ”c”. Bottom Left ”sigma”, Bottom Right -”m”. Red lines indicate ”truth”. . . 38

(9)

6 Graphs showing the density function from the 200 data sets run in Simulation 2. Top Left - ”a”, Top Right - ”c”. Bottom Left - ”sigma”. Bottom Right = ”m”. Red lines indicate ”truth”. . . 41 7 Graphs showing the density function from the 200 data sets run in Simulation

3. Top left ”a”, top right ”c”. Bottom left ”sigma.1”, bottom right -”sigma.2”.Red lines indicate ”truth”. . . 44 8 Graphs showing the density function from the 177 data sets run in Simulation

4. Top left ”a”, Top right: ”M1. Bottom left ”M2”. Bottom right -”sigma”. Red lines indicate ”truth”. . . 47 9 Graphs showing the density function from the 200 data sets run in Simulation

5. Top Left- ”M1”, Top Right - ”M2” Bottom Left - ”Sigma 1”. Bottom Right - ”Sigma 2”. Red lines indicate ”truth”. . . 50 10 Trajectories of 60 randomly sampled subjects from those identified as

belong-ing to Group 1 based on Model 3 outcomes. Dots represent a follow-up interval and darker blue dots represent more than 1 exacerbation in the interval. Pink dots are missing intervals. Black dots are deaths . . . 74 11 Trajectories of 60 randomly sampled subjects from those identified as

belong-ing to Group 2 based on Model 3 outcomes. Dots represent a follow-up interval and darker blue dots represent more than 1 exacerbation in the interval. Pink dots are missing intervals. Black dots are deaths . . . 75 12 Trajectories of 60 randomly sampled subjects from those identified as

belong-ing to Group 1 based on Model 5 outcomes. Dots represent a follow-up interval and darker blue dots represent more than 1 exacerbation in the interval. Pink dots are missing intervals. Black dots are deaths . . . 76

(10)

13 Trajectories of 60 randomly sampled subjects from those identified as belong-ing to Group 2 based on Model 5 outcomes. Dots represent a follow-up interval and darker blue dots represent more than 1 exacerbation in the interval. Pink

dots are missing intervals. Black dots are deaths . . . 77

14 Frequency of Exacerbations in subjects who died of pulmonary-related causes. Pi10 is a measure of airway wall thickness. Higher values indicate worse disease. Dyspnea is a measure of breathlessness, values over 2 indicate serious disease. SGRQ is St. George’s Respiratory Questionnaire which measures quality of life, higher values are associated with lower quality of life. GERD (gastroesophageal reflux) CVD (Cardiovascular diseases). . . 86

List of Equations

1.1 Zero-Inflated Poisson with weights . . . 7

1.2 Poisson Hurdle Model . . . 8

1.3 Longitudinal Hurdle Model . . . 9

1.4 Poisson Link Functions . . . 9

1.5 Zero-Inflated Poisson . . . 10

1.6 Negative Binomial Distribution . . . 12

1.7 Finite Count Mixtures . . . 12

1.8 Posterior Negative Binomial Mixture Distribution . . . 13

1.10 Accounting for exposure in a counting process structure . . . 14

3.1 Poisson Hurdle Model, modified to describe the case under COPDGene . . . . 27

3.2 Finite Longitudinal Count Mixtures, modified for application to COPDGene . 28 3.5 General Model Link Function . . . 29

3.6 Bayes Theorem . . . 29

(11)

4.2 Model 2 Link Function . . . 39

4.3 Model 3 Link Function . . . 42

4.4 Model 4 Link Function . . . 45

4.5 Model 5 Link Function . . . 48

5.1 Model 1: Link Function, adjusted for exposure . . . 61

5.2 Model 2: Link Function, adjusted for exposure . . . 63

5.3 Model 3 Link Function, adjusted for exposure . . . 65

5.4 Model 4 Link Function, adjusted for exposure . . . 66

5.5 Model 5 Link Function, adjusted for exposure . . . 68

6.1 Example survival function with constant hazard . . . 81

6.2 Example survival function with non-constant hazard . . . 81

6.3 log-Likelihood for Poisson cloglog model with 0-inflation . . . 81

6.4 Likelihood of Mixture Model conditional on Dropout Time . . . 87

(12)

1

Introduction

Identification of patients with differential risk of recurrent exacerbation or progression in Chronic Obstructive Pulmonary Disease (COPD) is an active research area. Techniques such as PCA, factor analysis, k-means clustering, and deep belief networks have been uti-lized to characterize the heterogeneity of clinical presentations in COPD patients to develop sub-types of disease (1)(2)(3)(4). Mixture models for latent class identification provide a further avenue for interpreting heterogeneity. Mixture models are flexible in that they can model longitudinal count data with unbalanced study designs while allowing for covariates in situations where the number of potential components (subgroups) is unknown(5).

The COPDGene study was established to identify genetic risk factors and subtypes of dis-ease in a longitudinal prospective cohort study of approximately 10,000 smokers(6). Study participants have been followed since 2010 with clinical visits scheduled every five years. Further follow-up has included a longitudinal survey every six months to contact the partic-ipants regarding changes to their health including the number of exacerbations experienced since their last contact with the study (7). It is hypothesized that patients with more fre-quent severe exacerbations have distinct characteristics from those with no or infrefre-quent severe exacerbations. The extensive data from the longitudinal follow-up (LFU) survey pro-vides the opportunity to explore methodologies in identifying subtypes of disease related to exacerbations occurring in subjects over time.

This paper evaluates mixture models with random effects for longitudinal components as an approach to analyze zero-heavy panel data. In order to estimate posterior distributions of the parameters of interest in these models, we apply Markov Chain Monte Carlo methods of random sampling in a probabilistic space. This paper is organized as follows: Section 1 is the introduction and provides an overview of relevant literature relating to exacerbations in COPD and the primary methods of interest. Section 2 presents descriptive statistics and

(13)

summaries of the motivating data set, the COPDGene study. In Section 3, we describe the methods applied and the model is specified. Section 4 reviews the simulation studies performed demonstrating that the proposed method works for subgroup identification in lon-gitudinal count data. Section 5 reviews applications of the mixture models to the motivating COPDGene data and discusses the results. Finally, in Section 6, we conclude this paper with directions of further work.

1.1

COPD and Exacerbations

Acute exacerbations in chronic lung diseases like COPD are associated with poor quality of life, loss of lung function, and increased mortality(8). A structure to identify patients with a higher risk of exacerbation at earlier stages of their disease could help health care practitioners to provide earlier preventative measures. Prior research has attempted to utilize computed tomographic (CT) phenotypes to find associations with subjects reporting worsening disease states in the 12 months prior to the first COPDGene study visit (9). Other authors have applied deep belief networks to classify exacerbation frequency with the findings that airway wall area, current smoking, BMI, and respiratory infections increase exacerbation risk (2). Dransfield et al. 2017, utilized data from the LFU surveys to fit linear mixed models showing associations of exacerbations with increased loss of forced expiratory volume (FEV1)(10). An earlier article applied multinomial logistic regression to data from the ECLIPSE cohort separating exacerbation frequency by none, one, and two or more exacerbations with the finding that patients with more exacerbations had worse FEV1, decreased quality of life, histories of gastroesophageal reflux, and increased white blood cell counts(11). An additional analysis from the ECLIPSE cohort looking at classification of patients as frequent or infrequent exacerbators based on exacerbation history found no

(14)

parameters that predicted transition between groups in later observation periods(12). Few, if any studies, have evaluated recurrent exacerbations in COPD or attempted to categorize patients based on longitudinal observation of such events.

The longitudinal follow-up survey from the COPDGene study represents a valuable source of insight about how exacerbations occur in COPD. Figure 1 illustrates the substantial heterogeneity of trajectories in COPDGene. Subjects are contacted at approximately six month intervals and asked ”Since we last talked on (month, year) have you had an episode of increased cough and phlegm or shortness of breath, which lasted 48 hours or more?”. If yes, they are further asked if they went to the emergency room or were hospitalized and how many times such events occurred. There are time points indicated in pink where subjects responded to questions in the call but either did not complete the questionnaire or answered ”do not know”. The next contact then answers how many exacerbations occurred since the failed contact (pink).

(15)

Figure 1: Random sample of the follow-up trajectories of 60 COPDGene patients. Each dot represents an attempted contact. Pink dots represent a failed contact. The darker blue dots represent one or more exacerbations and black dots represent a death. Subjects are ordered by follow-up time in months where 0 is their Phase 1 initial visit.

(16)

1.2

Statistical Literature Review

In a random sample of 60 COPDGene patients, we can see that exacerbations are rel-atively rare, but some subjects experience multiple severe exacerbations in a short amount of time.We hypothesize, that if subgroups can be identified, there will be useful clinical characteristics defining patients in one group over another. The LFU data is zero-heavy, unbalanced longitudinal count data.. Therefore, appropriate analysis of the data demanded a review of models used for count data, how to manage excess zeros in the count data and then, how to apply mixture models adjusted for imbalanced longitudinal data.

Typically, count data is modeled using the Poisson distribution or the negative binomial distribution. The Poisson distribution is defined by one parameter, λ which is equal to both the mean and variance of a discrete random variable Y (13) with Y ∼ f (y) = λye−λ

y! . λ = eX′β

where X is the covariate matrix and β is the vector of coefficients. The Poisson distribution may be used to model the number of times an event (like an exacerbation) occurs within an interval. The negative binomial is distinct from the Poisson in that it has a longer and fatter tail and allows for the variance to be different from the mean. It is defined by two parameters, ψ and µ where ψ is the precision parameter and µ = eX′β

where X is the covariate matrix and β is the vector of coefficients. It is essentially an extension of the Poisson distribution. The negative binomial also models the number of events, e.g. exacerbations per interval but allows for more flexibility than the Poisson distributions(14). Figure 2 shows the differences between a Poisson with a single parameter and the negative binomial with two parameters. This study chooses to evaluate a Poisson as the primary distribution of interest but recognizes that a negative binomial represents a reasonable distribution for panel count data where repetitive observations are obtained over multiple intervals for the same subjects.

(17)

Figure 2: Graphs showing cumulative distribution functions (top) and probability mass functions (bottom) of a Poisson λ = µ = 5 on the left and λ = µ = 10 on the right in red and several negative binomials with precision parameters varying between 0.1 and 50

(18)

For the zero-heavy component, the primary issue under consideration was how to manage zero-inflation related to structural zeros that are from a distinct distribution separate from the count values and sampling zeroes which are inherent in most discrete distributions. Ridout et al. 1998 wrote a comprehensive review of possible zero-heavy models. Mixed Poisson distributions, zero-modified distributions, and hurdle models were considered for possible relevance in association with our motivating dataset. Ridout et al. also helpfully provided plant-based examples for model interpretation which are included here for ease of understanding.

• Mixed Poisson Distributions are described as being applicable in a situation where cuttings taken from a plant may vary in two relevant ways: their basal diameter and their position on the plant. In such cases, mixed Poisson distributions are useful because they have a parameter that is allowed to vary from plant to plant such that λ = µQ where Q is a random variable, E[V ] = 1, V [Q] = a. Therefore, E[Y ] = µ and V [Y ] = µ + aµ2. The probability of 0 due to the inclusion of the random variable is higher than it would be in a non-modified Poisson. Typically, negative binomial distributions are used to model overdispersed data but other models can allow for more than one mode (most frequent value). Mixed Poisson regression models contain covariates in µ in a log-link generalized linear framework.

• Zero-Modified Distributions are used to model situations in which a portion of the sample is not at risk of an event or having a non-zero outcome. The proportion of the zero-heavy group can also be negative which allows for a zero-deflated distribution. For example, a certain proportion of cuttings from a plant are simply unable to root and therefore have a fixed value for the Poisson parameter

Y = y      w + (1 − w)e−λ , y = 0 (1 − w)e−λ λy , y > 0 (1.1)

(19)

In this case, E[Y ] = (1 − w)λ = µ and V [Y ] = µ + w 1−wµ

2. This model can extend to a regression framework where X and Z are matrices of covariates while β and γ are vectors of parameters with scale τ . log(λ) = Xβ and log( w

1−w)=τ Xβ this generally implies that w = (1 + λ−τ)−1. The zero-inflated Poisson (ZIP) model can be further generalized to have two unobserved count variables Y ∗ and D with marginal distribu-tions P ois(λ + Z) and P ois(θ + Z) where Z is the covariance between Y ∗ and D. The observed count Y is zero if either Y ∗ or D is 0 and is equal to Y ∗ otherwise. If Z = 0 then the distribution of Y reduces to the simpler ZIP with w = e−θ.

• Hurdle Models are a case where the data has potentially differing mechanisms for events occurring and the number of events that occur. The example here would be separately considering the proportions of cuttings that successfully rooted and then the mean number of roots per successful cutting. The mechanisms that drive ”rooting” are separate from those that determine the number of roots by those established plants. There are two specifications in the hurdle model: the probability that an event does not occur πo and a distribution for the number of events that occur if the so-called hurdle is cleared. Hurdle models evaluate the probability of clearing the hurdle and generating a non-zero count. The distribution is frequently a discrete structure like a truncated Poisson or negative binomial although a logarithmic distribution could also be used. P (Y = y) =        φo y = 0 (1 − φo) e −λλy (1−e−λ)y! y > 0 (1.2)

Reviews of methods that accomodate extra-zeroes have been written. In particular, Buu et al. 2012 which compared and contrasted hurdle and zero-inflated Poisson (ZIP) models. Their overall findings were that the hurdle model works best under their conceptual framework (substance abuse) but is not as effective as the ZIP when random effects are included. The study was motivated by data from substance abuse disorders where subjects frequently

(20)

relapse and have longitudinal trajectories complicated further by the frequent report of 0 symptoms resulting in 0-heavy data. The authors applied the models and simulations to data from the Michigan Longitudinal Study (MLS), a prospective cohort on alcohol use disorder symptom counts. Their work evaluated how the longitudinal aspects of the data changed the fit of the two model structures, the relative computational time, and the model performance under differing conditions of sample size, missingness, and levels of correlation.

Recall that hurdle models are essentially a two-stage decision structure where an action is made or not and then, to what extent did the action occur if the action began. Hurdle models can accommodate two sets of factors simultaneously for the two different stages. We extend Equation 1.2 to include the indicators for multiple events within each subject.

Yij =        0 , P (Yij = 0) = φij Truncated Poisson(λij) , 1 − φij

where i, j denote subject and event

(1.3) A subject can either be identified as subject who has developed symptoms, modeled by the truncated Poisson or a subject who is symptom free.

Symptom: log(λij) = x′ijβ + αi No symptom: log( φij 1 − φij ) = z′ ijγ + bi (1.4)

Here, αi and bi are the random effects accounting for within-subject correlation and between subject heterogeneity.

ZIP models are still finite mixture models. They were originally used in manufacturing to model the number of defects in a process moving randomly between perfect and imperfect states. We see below that the ZIP model varies very little from the Hurdle Model except-ing that it can tolerate samplexcept-ing zeroes and structural zeros instead of simply separatexcept-ing

(21)

structural zeros. Yij =        0 , P (Yij = 0) = φij Poisson(λij) , 1 − φij

where i, j denote subject and event (1.5)

The hurdle model and the ZIP model with no covariates are simply reparameterizations of each other and are mathematically equivalent. In general, with included covariates, the ZIP model is better for populations that have a group at risk and a group not at risk. Meanwhile, the hurdle model is most useful when the whole population is considered at risk and the occurrence of an event represents a hurdle being crossed. For example, Buu et al. 2012 presents ZIP models as being appropriate in comparing drinkers and non-drinkers. Non-drinkers have 0 symptoms while drinkers may or may not have 0 symptoms. In a hurdle structure, everyone is potentially at risk of alcohol dependence but only some people have at least one symptom. The hurdle structure is flexible enough to allow for both zero-inflated and zero-deflated values.

Zero-inflated models also should consider a longitudinal aspect where subjects are at different levels of risk over time. The hurdle model allows greater flexibility for these chang-ing states. In the ZIP, model components are fit simultaneously so maximization is more complex. In the hurdle model, the two terms can be separately maximized provided that random effects linking the two components of the model are not included.

Buu et al. applied a simulation study to test the appropriateness of the above models to generated data using the hurdle model to represent the structure of data from the MLS. The five covariates included were generated from a multivariate normal. They allowed for three different levels of correlation and three levels of sample sizes. In order to produce randomness within subjects to mimic different assessment schedules they randomly generated a covariate set at 20 predetermined waves with random shift applied. They also had three levels of missingness proportions structured as MCAR (missing completely at random unrelated to the outcome and any covariates). Each of the 9 simulations had 200 replications. Within

(22)

each simulation, they fit a Poisson regression, a ZIP, and a hurdle model with fixed and random effects for each. The Poisson was used to evaluate the role of ignoring missingness. The mean squared error was used to summarized the deviation of the 200 estimates from the true parameter in the fixed and random effects.

They found with increased sample size, model performance improves. At fixed levels for the three modulating factors (sample size, correlation, and missingness), the hurdle model performs better especially in the 0-component. The Poisson model without zero inflation performs the worst. The models all have increasingly problematic performance with increased missingness. If correlation within subjects increases, the hurdle model performs worse but the Poisson and the ZIP do not have clear patterns as correlations change. The hurdle model also was found to have faster computational time than the zero-inflated Poisson model(16). ZIP and hurdle models represent a structure for finite mixture models in the case of zero-heavy count data. Clinical studies of COPD discussed previously suggest that there may be further subgroups in subjects that have at least one severe exacerbation event. Therefore, we wanted to applythe longitudinal ZIP and/or hurdle models to mixture regression models for panel count data in the COPDGene study. Kurz and Hatfield 2019 provided a concise summary of latent group analysis in count mixture regression models including considering both computational structures and computational efficiency(17).

Kurz and Hatfield (2019) compared likelihood-based and parametric Bayesian mixtures to identify latent subgroups using negative binomials and zero-inflated negative binomials. Their motivating data was non-negative and right-skewed with heavy tails. It was also char-acterized by multi-modality and an abundance of zeros. The authors applied the properties of generalized linear models with zero-inflation techniques and mixtures for multi-modal flexibility and over-dispersion.

Their goal was to apply count mixture models to identify subgroups of patients with differing patterns of inpatient stay. They evaluated the usefulness of the models by their the

(23)

ability to estimate the mixture components, the mixture parameters, and the clusters identi-fied from the components. Specifically, they compared and contrasted maximum likelihood-based finite mixture models with Bayesian parametric mixture models looking at the relative detection of the number of mixture components and parameter estimation. Components in Kurz and Hatfield (2019) refer to the individual distributions in the mixture while cluster refers to the observations drawn from each mixture component.

Previous statistical literature to Kurz & Hatfield had mostly considered cases where the Poisson distribution was used to model count data. In Kurz & Hatfield (2019), the negative binomial is considered. f (y|µ, ψ) = Γ(y + ψ) Γ(ψ)Γ(y + 1)  ψ µ + ψ ψ , (1.6)

They extended this to a zero-inflated negative binomial which allows for two sources of zeros - a point mass at 0 (structural) and as part of the negative binomial distribution (sampling). Using a canonical log-link, the regression model is:

E(y|x) = φ + (1 − φ)exp(X′β)

Further, the authors summarized finite mixture models of count distributions. The dis-tribution of the outcome accounting for mixtures is a weighted sum of the negative binomial mixture components. f (y|w, x, β, ψ) = K X k=1 wkNegBin(y| exp(X′βk), ψk)znk (1.7) The model is parameterized by µk = exp (X′βk) and the precision parameter, ψk. The contribution of each mixture is determined by the weighting parameter wk, wk ∈ [0, 1] where PK

k=1wk = 1. The model is extended to the case of zero-inflation by adding φk as well as subscripts of n to identify individual observations of the outcome/covariates. Finally, the observed data has cluster membership indicators of znk which equal 1 if drawn from component, k and 0 otherwise such that, PK

(24)

In order to manage the missing data problem of group membership in their consider-ation of finite mixture models, they applied an expectconsider-ation-maximizconsider-ation (EM) algorithm which iterates between the expectation of the complete data log-likelihood with respect to the conditional distribution of the missing component indicators and the maximization of the expected log-likelihood in the parameters. The EM algorithm produces posterior prob-abilities for each observation:

P (Znk = 1|β, ψ, w, x) =

wkNegBin(yn|exp(X′βj), ψj) P

k6=jwkNegBin(yn|exp(X′βk), ψk)

(1.8) The prior probabilities here are used to classify each observation into a cluster based on the component for which it has the highest posterior probability of membership - or weight by posterior probabilities of being in each cluster. This should give the outcome of summaries of the clusters of observations. The choice of the number of components (k) was dependent on the AIC and BIC for models of various k values.

The authors additionally evaluated the efficacy of Bayesian mixture models as they pro-vide the ability to place a prior distribution on the indicators of component membership (znk). The prior distributions chosen are commonly multinomial with a conjugate Dirichlet prior on the mixture probabilities.

p(z|w) ∝ N Y n=1 K Y k=1 wznk k and p(w) = Dir(w|ao) = Γ(Kao) Γ(ao)K K Y k=1 wao−1 k (1.9)

where Γ(.) is the gamma function and ao is a hyperparameter. The maximum of K (or the number of components) is set to 20. Complete model specifications include the priors of the regression coefficient and the precision parameters of the component regression models. βkd ∼ N (mo, so), ψk ∼ logN orm(ao, bo) are defined as weakly informative priors: ao = 0.1, mo = 0.0, so = 10 respectively. As in finite mixture models, the mixture models specified here produce posterior estimates of the frequency of the component as well as posterior probabilities of membership in each mixture component for each observation. The

(25)

The STAN sampler was unable to support discrete latent variables so they marginalized over the component assignment variables.

The simulation studies run by Kurz and Hatfield involved the generation of data from mixtures with 2, 3, 4, and 5 components with varying amounts of overlap. The first three covariates in the models were generated from N ∼ (0, 0.5). The intercepts, regression co-efficients and the dispersion parameters ψk were used to produce high, medium, and low overlaps of the distributions. Finally, the outcome was generated as the product of a nega-tive binomial: y ∼ NegBin(exp(X′β

k), ψk). Generally, they found that the Bayesian methods had more precise estimates than the finite mixture models in EM algorithms(17).

With recognition of the potential influence of the variable time of follow-up on the model inference, we also reviewed Baetschmann et al. 2013.

Baetschmann et al. evaluated the analysis of zero-inflated count data when the time of exposure varies which is an important consideration for data sets like COPDGene where the patient follow-up time varies. They propose a modified count data model where the probability of excess zeroes is derived from a duration model with a Weibull hazard rate(20). They compare their model to the standard Poisson model with logit zero inflation seen in the work of Buu et al. 2012.

For mixture models of longitudinal count data where exposure time varies but there are not excess zeroes, differences in Ti are managed by including log(Ti) in the regressors. If δ = 0, there are no exposure effects(20).

λ(xi, Ti) = exp(x‘iβ + δ log Ti) (1.10)

In order to better understand the aspects of our literature review and how they could apply to our modeling, we summarize and describe the baseline characteristics of our studied cohort as well as the longitudinal follow-up data.

(26)

2

Descriptives

The longitudinal exacerbation data from the COPDGene study presents several compli-cating factors. It is extremely 0-heavy with nearly 75% of subjects not experiencing severe exacerbation events over their full follow-up time. Additionally, variations in follow-up in-tervals and time followed limit comparability between subjects. Finally, mortality in the cohort was subject to competing risks with a large proportion having non-pulmonary or un-adjudicated causes of death. See Figure 1 for a sampling of patient trajectories and events and Table 1 for descriptive characteristics of the population evaluated in this paper.

The COPDGene data provided several challenges for organization. We wanted to look at exacerbation patterns in the context of the important covariates measured at the study visits including spirometry and some CT measures. Additionally, we were interested in the impact of exacerbations on the long-term outcome of mortality (this required combining three different data sets into one). Through this, we could begin to detect distinctions in the data. Here, I present several structures of summary statistics.

All subjects enrolled in COPDGene who did not have any longitudinal follow-up (LFU) survey contact were droppped (1405 subjects). It is unclear why these subjects didn’t have any follow-up but we were unable to use them in analysis of exacerbations over time. Subjects who had lung transplants, bronchiectasis, or interstitial lung disease were considered subjects who met exclusionary criteria after enrollment and were also dropped.

Finally, some subjects did not start LFU until long after the initial study visit. If they were in the 95th percentile or above for the first LFU contact or if they only had a single LFU contact, they were removed from the cohort. These exclusions lead to a final cohort count of 7495 subjects with a total of 83,726 contact points.

The mean follow-up time was 92.26 months with the least at 8.2 months and the most at 131 months. The mean average interval between LFU contacts was 8.3 months with a

(27)

minimum of 1.5 and a maximum of 52.5 months. The maximum number of total exacer-bations (either hospitalization or ER visits or both) was 33 and the vast majority of the subjects, 5559, had no exacerbations over their follow-up time. For subjects with at least 1 exacerbation, 761 had only a single event over study time; 642 subjects had two or three; and 533 subjects had more than four exacerbations.

Almost half of the cohort was male (49%), see Table 1. Non-Hispanic whites made up 86% of the cohort. The mean age at baseline was 61.2. The cohort was slightly overweight with average BMI at 29.05. In terms of smoking history and disease severity, mean pack years smoked was 44.1 and 36.5% were still smokers as of visit 1. Disease severity at baseline was distributed across the GOLD stages of COPD. A total of 1% subjects were non-smoking controls, 12% were in the PRISM group, 40.5% were non-symptomatic smokers (GOLD 0), 8% were GOLD 1 with FEV1 at 80% of normal, 20.3% were GOLD 2 between 50% and 80% of normal FEV1, 12.25% were GOLD 3 with severe emphysema and FEV1 between 30% and 50% of normal. Finally, 6% were GOLD 4 with severe COPD either by FEV1 measures or low blood oxygen (21).

(28)

Characteristic Structure Value n 7495 Age mean (sd) 61.20 (8.91) Gender (F) count(%) 3783 (50.47)) Race (b) count(%) 1800 (24.02)) BMI mean (sd) 29.05 (6.27))

Pack Years mean (sd) 44.10 (25.38) Current Smoker (y) count(%) 2738 (36.54) GOLD STAGES Never-Smoker count(%) 98 (1.31) PRISM count(%) 878 (11.76) 0 count (%) 3019 (40.45) 1 count(%) 604 (8.09) 2 count(%) 1514 (20.29) 3 count(%) 914 (12.25) 4 count(%) 436 (5.84)

Table 1: Characteristics of Cohort. GOLD stages correspond to disease severity indicated by spirometry values - specifically with FEV1/FVC <0.7 and declining FEV1 such that stage 4 has less than 30% predicted output while stage 1 has more than 80% predicted output

(29)

2.1

Cohort Description by Exacerbation Count.

Within the cohort, we wanted to explore the characteristics of subjects with differing numbers of exacerbations. Based on the quantiles of the non-zero exacerbation count, we reviewed subjects with 1, 2-3, and 4+ exacerbations to compare with those without any exacerbations.

As noted previously, the most common trajectory for subjects was having no exacer-bations that involved hospitalization or an ER visit. 5,559 subjects had no exacerexacer-bations. Their mean pack years smoked was 42 and 38% of the non-exacerbating group were current smokers at the initial visit . The mean BMI was 29; the mean age was 61 and 23% of the subjects were African Americans. 49% of the subjects were female. Unsurprisingly, 92/98 never-smokers were in this group. 12% of the non-exacerbators were in PRISM, 48% were from GOLD 0, 9% were from GOLD 1, GOLD 2 had 18.3%, GOLD 3 and GOLD 4 together made up 10% of the non-exacerbators (Table 2).

In discussing disease severity, we consider four primary characteristics - spirometry, CT measures from analysis by Thirona (Njimegen), symptom/function measures, and comorbid diseases. The non-exacerbating group has a mean FEV1 percent predicted of 81 and a mean FEV1/FVC ratio of 0.7. CT measures included a mean percent emphysema of 5.35, Pi10 of 2.2, and percent gas-trapping of 18.8. For disease symptoms and function level, the mean six minute walk distance was 1432ft while the SGRQ (St.George’s Respiratory Questionnaire) mean was 21 where lower values indicate better health. The MMRC(modified Medical Research Council) dyspnea score was split at a cut-point of 2 where greater than or equal to 2 indicated a significant increase in shortness of breath. The percent of the non-exacerbation group at a value higher than 2 was almost 31% (Table 3).

The percentage of subjects reporting comorbid diseases included chronic bronchitis(15%), asthma(14%), diabetes(15%), GERD (gastroesophageal reflux)(26%), and cardiovascular disease (14%) (at least one of the following: angina, congestive heart failure, coronary artery

(30)

disease, heart attack, or CABG (coronary artery bypass grafting)) (Table 3).

761 subjects reported a single exacerbation over study time. 54% were female with a mean age of 63. 22% reporting African American ancestry and the mean BMI was 29. 32% of the subjects in this group still smoked cigarettes at the first visit with mean pack years of 46. 4 never-smokers were in this group, PRISM made up 12.4%, GOLD 0 was 25%, GOLD 1 had 6.7%, GOLD 2 was 26.3%, GOLD 3 and GOLD 4 combined represented almost 30% of subjects with a single exacerbation (Table 3).

Mean FEV1 percent predicted was 66% while the mean FEV1/FVC ratio was 0.6. CT measures for subjects with single exacerbations included more emphysema (9.7%), higher gas-trapping (29%), and a mean Pi10 measure of of 2.5. The mean six minute walk distance decreased to 1305ft. The mean total SGRQ score increased to almost 34 while the percent of subjects with significant shortness of breath (MMRC Dyspnea) was almost 56%. Comorbid diseases tended to be slightly more common. 23% reported chronic bronchitis, 22% reported asthma, and 19% reported diabetes. Cardiovascular diseases were reported by 18% and 35.5% of the group had GERD (Table 3).

642 subjects had more than one exacerbation but fewer than four. They again had a mean age around 62 and mean BMI at 29, 52.6% of the cohort was female and 30% was of African American descent. At the initial clinical visit, 36% were still smoking and the mean number of pack years smoked was almost 50. 2 of the never-smokers reported at least 2 exacerbations, 10% of this group were in PRISM, 20% were in GOLD 0, and GOLD 1 was less than 6% of the group, GOLD 2 represented 26%, GOLD 3 had 25% and finally GOLD 4 was slightly less than 13% (Table 2).

Mean spirometry measures were 61% predicted of FEV1 and a FEV1/FVC ratio of 0.57. Mean distance walked in six minutes was 1202ft while mean SGRQ score increased to 41.5. Almost 70% of the subjects had substantial difficulty with shortness of breath via the MMRC Dyspnea score. Mean emphysema percent was 11%, gas-trapping percent was 33% and the

(31)

mean Pi10 measure was 2.6. The comorbid disease percentages were as follows: chronic bronchitis (29.4%), asthma (25%), diabetes (22%), GERD (37%), and CVD (cardiovascular disease) was 23% (Table 3).

(32)

Figure 3: Distribution of exacerbation rates in the overall cohort including those without exacerbations during the follow-up period. Y-axis is on the log scale. Note that the maximum recorded exacerbation rate is 0.78. 25% is 0. 75% is 0.01. 95% is 0.06. 99% is 0.15.

(33)

The final group of 533 subjects with four or more exacerbations had a wide distribution of actual exacerbation counts (see Figure 4). Mean age was 61 and mean BMI at 29. 54% were female, almost 32% identified as African American, 32% were still smoking at the first study visit and had been smoking on average for 52 pack years. This group was dominated by GOLD 2 (25%), GOLD 3 (29%), and GOLD 4 (18%). PRISM was 10%, GOLD 0 13.5%, and GOLD 1 almost 4% of the frequent exacerbators (Table 2).

They are sicker than the less frequent exacerbators. The six minute walk mean was 1106ft with a mean SGRQ score of 48. 75% had clinically significant dyspnea (MMRC). Mean emphysema increased to 12.8%, gas-trapping to 36.8%, and Pi10 at 2.7. Mean FEV1 percent predicted was slightly under 55% and the ratio of FEV1/FVC was around 0.53. In general, there was more cardiovascular disease (27%) and asthma (35%) but similar proportions of GERD (39%), diabetes (22%), and chronic bronchitis (30.4%) (Table 3).

When considering exacerbation rates across the follow-up time, the vast majority (see Figure 3) had again very infrequent exacerbations. Wen we take a random sample from different quantiles of the distribution of exacerbation rates, we can see that the trajectories of subjects vary quite a lot, Figure 4. As the exacerbation rate increases above 0.01 (75th percentile and above), we see more clusters of events occur with multiple severe exacerbations occurring within an interval (sometimes within as little as 5 or 6 months). Subjects may have gone a long period without exacerbations but suddenly may have a great deal in a small amount of time, thus increasing their overall rate. In the 25th to 75th percentile, exacerbations occur but they are sporadic and never more than one in an interval.

(34)

Table 2: Characteristics of Cohort, by total exacerbation count

Characteristic

Structure

0 Exac.

1 Exac.

2-3 Exac.

4+ Exac.

n

5559

761

642

533

Age

mean (sd)

60.97 (8.90)

62.64 (8.6)

61.69 (9.31)

60.98 (8.70)

Gender (F)

count(%)

2749 (49.45)

407 (53.48)

338 (52.65)

289 (54.22)

Race (b)

count(%)

1278 (22.99)

165 (21.68)

188 (29.28)

169 (31.71)

BMI

mean (sd)

28.96 (6.07)

29.17 (6.69)

29.16 (6.67)

29.57(7.13)

Pack Years

mean (sd) 42.36 (24.87) 46.41 (23.97) 49.35 (26.63) 52.63(28.11)

Current Smoker (y)

count(%)

2088 (37.56)

247 (32.5)

231 (35.98)

172(32.27)

GOLD STAGES

Never-Smoker

count(%)

92 (1.66)

4 (0.53)

2 (0.31)

0 (0)

PRISM

count(%)

665 (12.01)

94 (12.37)

65 (10.24)

54 (10.15)

0

count (%)

2636 (47.62)

187 (24.61)

124 (19.53)

72 (13.53)

1

count(%)

497 (8.98)

51 (6.71)

36 (5.67)

20 (3.76)

2

count(%)

1012 (18.28)

200 (26.32)

168 (26.46)

134 (25.19)

3

count(%)

438 (7.91)

161 (21.18)

160 (25.2)

155 (29.14)

23

(35)

Table 3: Disease Severity, by total exacerbation count. Pi10 is a measure of airway wall thickness. Higher values indicate worse disease. Dyspnea is a measure of breathlessness, values over 2 indicate serious disease. SGRQ is St. George’s Respiratory Questionnaire which measures quality of life, higher values are associated with lower quality of life. GERD (gastroesophageal reflux) CVD (Cardiovascular diseases).

Characteristic Structure 0 Exac. 1 Exac. 2-3 Exac. 4+ Exac.

n 5559 761 642 533

FEV1 %pred mean (sd) 80.84 (23.28) 65.94 (25.45) 61.05 (25.96) 54.62 (24.65) FEV1/FVC mean (sd) 0.69 (0.14) 0.60 (0.17) 0.57 (0.18) 0.53 (0.17) % Emphysema mean (sd) 5.35 (8.42) 9.70 (11.61) 11.09 (12.55) 12.85 (13.61) % Gas-Trapping mean (sd) 18.75 (17.54) 29.22 (21.54) 32.77 (22.67) 36.84 (22.20) Pi10 mean (sd) 2.23 (0.57) 2.48 (0.63) 2.59 (0.58) 2.71 (0.63) Walk Distance (ft) mean (sd) 1431.75 (380.63) 1305.37 (357.89) 1202.55 (381.19) 1106.45 (374.18) SGRQ Score mean (sd) 20.92 (20.18) 33.79 (21.55) 41.55 (21.4) 47.98 (21.79) MMRC Dyspnea Score (≥ 2) count(%) 1707 (30.73) 422 (55.67) 446 (69.58) 402 (75.56) Comorbid Diseases

Asthma count(%) 764 (13.74) 173 (22.73) 161 (25.08) 185 (34.71) Chronic Bronchitis count(%) 834 (15) 173 (22.73) 189 (29.44) 162 (30.39) Diabetes count (%) 835 (15.03) 144 (18.92) 139 (21.68) 118 (22.18) GERD count(%) 1465 (26.36) 270 (35.48) 239 (37.23) 206 (38.72) CVD count(%) 740 (13.9) 134 (18.28) 142 (23.28) 137 (27.24)

(36)

Once again, the primary interest in evaluating the LFU data is to distinguish charac-teristics that predict disease trajectories in a clinically relevant way. We see in the above tables are descriptive statistics that subjects who have more exacerbations appear to have more characteristics associated with more severe disease at baseline. We looked at random samples of the above groups over the study time to gauge the variability in follow-up, the frequency of exacerbations, the amount of missing data, and whether the subject was still alive at the last point of follow-up. In general, summary statistics suggest that there are differences between subjects who exacerbate rarely and those who exacerbate frequently.

The random samples also indicated important considerations for applying simulation studies. There is substantial variability in the length of follow-up, the intervals covered vary and some intervals are marked as fully missing indicating that subjects began an LFU survey but either did not complete it or answered ”do not know” in response to exacerbation ques-tions. Therefore, model structures had to accommodate the missing intervals, the differing interval lengths, and the differing lengths of follow-up for individual subjects.

(37)

Figure 4: Figures showing follow-up history of subjects in different quantiles of exacerbation rates. Top left: random sample of subjects from full-cohort. Top right: random sample of subjects in the 25th-75th quantiles of exacerbation rates, ranging from 0 to 0.01 events per month. Bottom left: random sample of subjects in the 75th-95th quantile of exacerbation rates, ranging from 0.01 to 0.06 events per month. Bottom right: random sample of subjects with the top 5% of exacerbation rates, ranging from 0.06 to 0.78 events per month.

(38)

3

Methods and Model Development

In order to effectively model the data distribution of exacerbation events in COPDGene we considered a two-component (hurdle) model. The first component distinguished between subjects with one or more exacerbations and those with none using a hurdle model (func-tionally a logistic regression); and within the subjects with at least one exacerbation, applied Bayesian mixture models to identify subgroups - generally distinguishing between more or fewer exacerbations over time.

3.1

Hurdle Model

Let Yi be the vector of event counts for individual i, i = 1, ...N . We model the probability of zero counts, φo as follows

P (Y = y) =        φo y=0 (1 − φo)P (Y = y | y 6= 0) y 6= 0 (3.1) where log( φo 1 − φo ) = exp(Xi′βbin)

. The covariates, Xi represent the covariates associated with the probability of no events observed in the entire follow-up for individuals and their corresponding regression coefficients, βbin. We explain further below the model for the counts at different intervals of time, )P (Y = y | y 6= 0). The hurdle model had the total exacerbation count as the outcome of interest but adjusted for follow-up time as a covariate in the model. As a logistic model, the hurdle model included covariates - a selection of patient characteristics from the initial Phase 1 visit of COPDGene. The covariates included the following: smoking status, age, gender, race (non-Hispanic white vs. black), frequency of severe exacerbations in the year prior to the initial visit, FEV1 percent predicted, FEV1/FVC, chronic bronchitis diagnosis,

(39)

percent emphysema, percent gas-trapping, airway wall thickness, dyspnea score, six-minute walk distance, and ATS pack years. Essentially, the estimated parameters are the association of a one-unit change in the covariate to the odds of having an event (logistic model).

3.2

Mixture Models

We can more specifically discuss the longitudinal structure of the mixture models as applied to COPDGene.

Let yij be the count of events for individual i in interval j, j = 1, ..Q, i = 1, ...N . Let Ij be the interval length for a given subject i at time j. Conditional on cluster membership and the random intercepts, the distribution of counts is assumed Poisson.

N Y i Q Y j Z Y z vz e−λijλyij ij yij! zik (3.2) The model is parameterized by log(λij|Z = z) = Xij′ βk+ log(Ij) + αiz.

   α1i α2i   ∼ N       u u′   ,    σ2 α1 σα1α2 σα1α2 σ 2 α2       (3.3)

where Z = 2 and the cluster membership indicator k, is 1 if drawn from component z and 0 otherwise. vz is a weighting parameter where vz ∈ [0, 1] such that

PZ

z=1vz = 1. Further, each subject, i, has each interval, of length Ij included in the model estimate. σαz is the

standard deviation of the random effect. The square root of the precision value is τ where ταz = 1/σαz. The means of the random effects α1i and α2i are given values of u and u

as the two are allowed to differ in some of the applied models.

Further the probability of group membership given the weighting parameter, vz is given a Dirichlet prior probability as referenced in Eq. 1.9 and modified below to reflect the notation for our models. ν represents K positive real numbers which influence the weighting of the random variable z in the Dirichlet distribution.

p(z|v) ∝ N Y K Y vzik k and p(v) = Dirichlet(v|ν) (3.4)

(40)

In the case of the COPDGene data, we more specifically parameterize λ as the following dependent on cluster membership indicator z:

λij =       

exp(M 1 ∗ time+a+c+ log(Ij) + α1i) , z=1 exp(M 2 ∗ time+a+c+ log(Ij) + α2i) , z=2

(3.5)

where if M 1 = M 2, the coefficient for time is referred to as M . If α1i = α2i, the random effect is referred to as α1i. c is sometimes 0.

3.3

Bayesian Methods and Prior Selection

While R and SAS provide some packages and procedures to manage mixture models, they do not permit the inclusion of unbalanced longitudinal data and may limit model extension in future work(22). Additionally, our review of relevant literature indicated that Bayesian methods may be more advantageous than than finite mixture models estimated through EM algorithms(17).Therefore, we turned to JAGS (Just Another Gibbs Sampler) to develop and test our models(23).

Bayesian data analysis allows a robust but very flexible estimation of parameters. It relies on the structure of Bayes Theorem seen in Eq. 3.1 where the posterior distribution of θ, P (θ|y) is a function of the sampling distribution P (θ, y) and the prior P (θ), conditioned on the known data distribution P (y). The posterior distribution essentially indicates which values of a parameter maximize the probability of observing the data conditional on the prior beliefs about the parameter values(24). In cases like mixture models, the existence of unknown parameter values and the complexity of heterogeneous data make it difficult to calculate the posterior distributions of the parameters.

P (θ|y) = P (y, θ) p(y) =

P (y|θ)P (θ)

(41)

Markov chain Monte Carlo (MCMC) is a Bayesian method of simulating the posterior dis-tribution. The Markov component refers to sequential simulation such that the next draw is solely dependent on the current state, not on historic states. Simulating the posterior distributions of the parameters of interest through MCMC algorithms allows us to estimate a moderately complex model and its parameters using sequential simulation. JAGS, the program used in this project uses a particular MCMC algorithm, the Gibbs Sampler which applies alternating conditional sampling where subsets of the parameter vector are drawn and the iterations of the sampler draw values conditional on the values of the other subsets. In mixture models, the Gibbs sampler alternates between two steps - sampling from the distribution of latent group indicators given the model parameters and sampling from the model parameters given the indicators(24).These steps are fleshed out in the specification of our simulations.

The JAGS model is specified in the following manner: At each visit in the total number of visits for all subjects, the outcome is estimated as a function of Poisson where there are two possibilities for the value of λ dependent on the possible mixture component to which that the subject belongs. λ is defined differently depending on the model structure but contains some structure to prevent label switching and a random intercept to account for the repeated measures within subjects(25). The JAGS model further estimates the latent group indicator (z) and the random intercept for each subject. In models with no covariates, the prior probability of group membership is defined through a Dirichlet distribution which is the conjugate prior of the multinomial distribution.

A primary concern in mixture modeling is label switching where a lack of prior informa-tion on the components of the mixture causes the prior distribuinforma-tion to be the same for all permutations of parameters associated with individual mixtures such that estimates of the means of individual components are not intuitive (25). There are a few ways to deal with the label switching issue. One would be to place an identifiablility constraint on a

(42)

compo-nent of the parameter space such that the parameter for one group is consistently smaller than that of the other(25). As the motivating data for our work is positive and discrete, we apply an identifiability constraint on each of the simulated models such that outcomes from Component 1 are structured to have a consistently lower outcome counts than those from Component 2.

(43)

3.3.1 Priors

In general, we chose non-informative and weak priors. We wanted the data to dominate the inferences from the models. Notice that both c and τ1,2 are positive as a results of the use of a gamma distribution as the prior. c must be strictly positive because it represents the label switching prevention structure for Models 1,2, and 3. τ is positive because it represents the inverse of the standard deviation which is always positive.

a ∼ N (0, 0.0001) c ∼ Γ(0.1, 0.1) τ1,2 ∼ Γ(0.1, 0.1) pz[1, 2] ∼ Dirichlet(ν = 1, 1)) αiz ∼ N (0, τ1,2) M ∼ N (0, 0.01) M 1 = min(M 10, M 20) M 2 = max(M 10, M 20) M 10 ∼ N (0, 0.001) M 20 ∼ N (0, 0.001) (3.7)

(44)

3.4

Model Convergence and Selection

Since we are using MCMC by simulation to fit models, we wanted to consider how well the model converged. This means that we needed to evaluate the Gelman-Rubin statistic and Geweke’s diagnostic. Further, we needed to compare the relative model fits. There are a couple of options for this in the context of Bayesian analysis. The ones considered here are the following:

In sequences of MCMC chains with adequate convergence, we assume that inference can be made about the parameter values based on the mean and variance from the simulated values. The Gelman-Rubin statistic essentially evaluates whether parameter inferences from each chain are adequately similar. The statistic is a balance between the within-sequence variance and the between sequence variance, giving an estimate of the overall variance. If the Gelman-Rubin statistic ( ˆR) is large, the variance estimate could be decreased or the within-sequence variance could be increased through more simulations. Generally, we want this value to be close to 1(26).

Geweke’s Diagnostic is another statistic that evaluates convergence through the difference in the means calculated between the first itA iterations and the last itB iterations and then dividing this value by the standard error of the difference. If the ratios of the itA,B : ittotal are fixed and the sum is less than the total number of iterations (itA + itB < ittotal), the distribution of Geweke’s diagnostic becomes close to a standard normal as n becomes large by the Central Limit Theorem. Geweke’s can also inform how many iterations should be discarded in the burn in process (27)(28).

The Deviance information criterion (DIC) is functionally a combination of a measure of goodness of fit and the model complexity such that the deviance (D(θ) = −2log(p(y|θ) is combined with posterior mean deviance and the deviance of the posterior means. Like other information criterion, better models have lower DIC values. Distinct differences between models should exceed 5-10 points(29).

(45)

The Watanabe-Akaike information criterion (WAIC) is an extension of the DIC meant to better manage the complex results of Bayesian models. The log pointwise posterior predictive density is estimated first with an additional correction for the number of parameters(30). We will estimate the likelihood and then apply the WAIC function from the ’loo’ package in R (32).

Posterior Predictive Checks are a further method to gauge model fit where the posterior distribution is used to simulate data(31). The simulated data from the fitted model is compared to the original data One method evaluate the fit of a model is to use posterior predictive checks.

3.5

Simulations

In simulation studies, we considered five base structures for simulation each differing in the expression of group identification. Each allowed us to test slightly different scenarios and to understand what structure the COPDGene data may best fit.

• Simulation 1 distinguished group membership through two different intercepts sepa-rated by a constant term centered at another constant.

– Fixed intercepts: a − .5c 6= a + .5c – Random intercepts: αi1 = αi2 – Slope: M 1 = M 2

• Simulation 2 also distinguished group membership with two intercepts but did not attempt to center based on a constant intercept.

– Fixed intercepts: a 6= a + c – Random intercepts: αi1 = αi2 – Slope: M 1 = M 2

(46)

• Simulation 3 extended the structure of Simulation 1 but allowed for two random inter-cepts distributed dependent on group membership.

– Fixed intercepts: a − .5c 6= a + .5c – Random intercepts: αi1 6= αi2 – Slope: M 1 = M 2

• Simulation 4 distinguished group membership by the varying the slope of time depen-dent on group.

– Fixed intercepts: a1 = a2 – Random intercepts: αi1 = αi2 – Slope: M 1 6= M 2

• Simulation 5 combined the random intercepts distributed dependent on group mem-bership with differing slopes related to time.

– Fixed intercepts: a1 = a2 – Random intercepts: αi1 6= αi2 – Slope: M 1 6= M 2

Each simulation was evaluated for goodness of fit and responsiveness when there were no groups in the data - the ”null” case. Early simulations did not have varying follow-up times but the final models fit did permit both varying follow-up times and differing intervals. In order to approximate similar trajectories to those in the COPDGene dataset, we took a random sample of lengths of follow time and the number of longitudinal follow-up contacts for 400 subjects from our motivating data. These values formed our time and counting variables for the simulations themselves. Therefore, the 400 subjects in the simulation studies were

(47)

time. The primary limitation in the simulations was that the interval time within each subject was constant and no subjects had missing intervals which are relatively common in the motivating data.

4

Simulations

There were two possible groups for membership with an 80% probability of being in one and 20% of being in the other. The number of events was modeled as a Poisson with no truncation, and a link function in the most simple case. We allowed the follow-up time to vary in both the period followed and the number of time points. Further, we evaluated what occurred when the simulated data had the two groups as suggested by the models and when there were no groups in the data, the null case.

In terms of how well these models performed, we can compare how well the models estimated the model parameters in two cases: 1. with two groups and 2. with no subgroups. For each simulation, we present the results in a table and in figures. All simulations had a burn-in of 3000 iterations, 2 chains, and 7000 total iterations. 200 datasets of 400 subjects were simulated and fit with each model both in the ”true” case of subgroups and in the ”null” case of no subgroups.

The simulations differed from the actual dataset in that there were no missing intervals and all subjects had the same starting date with no offsets for beginning the study. Further, subjects had consistently spaced follow-up intervals unlike the actual study where interval lengths vary substantially.

4.1

Model 1

In this scenario, time is in months and dependent on the individual while a and c represent constant intercepts to prevent label switching. The two groups are separated by a constant distance of c centered around a. In this case, group two represents the group with fewer

(48)

outcome counts. z is the group membership indicator and αi is a random intercept to allow for the repeated events over time.

Recall Eq. 3.5 where the link function for the mixture models was defined and consider the case where λ is given the following structure:

λ =        exp(M ∗ time+a - .5c+αi) , z=1 exp(M ∗ time+a + .5c+αi) , z=2 (4.1)

Simulation 1 performed moderately well. Figure 3 shows the accuracy of the estimates of each simulation in the two cases. We see that generally the two group simulation estimated the constant intercepts accurately. What is interesting is that c is estimated as 0 in models using the null group data while a incorporates the difference of c. This indicates that in the case of no groups, Simulation 1 would give indication that no groups were identifiable. On average, Simulation 1 took 1076 seconds to fit the model and estimate the parameters.

Generally, Simulation 1 fits a situation where one group has more overall events than the other such that subjects experience consistently more or consistently fewer exacerbations.

Table 5 indicates that the model fits fairly well across 200 data sets. The random effect standard deviation is overestimated (0.11 vs. 0.01) but the probability of group membership is relatively close to the true value. The overestimated standard deviation would suggest that this model estimates more extreme values such that the between subject differences are more substantial while the mean stays the same.

(49)

Figure 5: Graphs showing the density function from the 200 data sets run in Simulation 1. Top Left - ”a”, Top Right - ”c”. Bottom Left - ”sigma”, Bottom Right - ”m”. Red lines indicate ”truth”.

(50)

Parameter Median Mean Standard Deviation True Value a -1.515 -1.520 0.096 -1.5 c 3.436 3.439 0.130 3.5 pi.1 0.204 0.204 0.028 0.20 pi.2 0.796 0.796 0.028 0.80 m 0.003 0.003 0.001 0.003 tau.1 12.583 34.156 51.924 100 sigma.1 0.079 0.111 0.097 0.01

Table 4: Summary 1: Estimated Values vs. ”True Values”

4.2

Model 2

This is a reparameterization of Model 1 (Eq. 4.1). Recall Eq. 3.5 where the link function for the mixture models was defined and consider the case where λ is given the following structure: λ =        exp(M ∗ time+a+αi) , z=1 exp(M ∗ time+a + c+αi) , z=2 (4.2)

This again fits a situation where one group has more overall events than the other such that subjects experience consistently more or consistently fewer exacerbations. Simulation results suggest that Simulation 2 tends to be more easily estimated by the MCMC chains.

Generally, simulations of data for Model 2 were well-estimated and arguably better es-timated than the parameterization in Model 1 simulations. This is especially true for the estimate of sigma, the random effect standard deviation, (0.024 estimated vs. 0.01) - sug-gesting that this model did not estimate as many extremes in between subjects differences.

(51)

Parameter Median Mean Standard Deviation True Value a -1.514 -1.515 0.064 -1.5 c 2.003 2.003 0.065 2 pi.1 0.202 0.202 0.029 0.20 pi.2 0.798 0.798 0.029 0.80 m 0.003 0.003 0.001 0.003 tau 44.019 46.456 15.776 100 sigma 0.023 0.024 0.008 0.01

(52)

Figure 6: Graphs showing the density function from the 200 data sets run in Simulation 2. Top Left - ”a”, Top Right - ”c”. Bottom Left - ”sigma”. Bottom Right = ”m”. Red lines indicate ”truth”.

(53)

4.3

Model 3

Model 3 has the same fixed effect structure as Model 1 but it extends it further by giving each group a unique random intercept (αi, bi) distributed at different means but with the same variance. Generally, we can think of Simulation 3 as being a case where there are differences in the amount of events in each group but one group may also vary more over time than the other - illustrated by the two different random intercepts.

Recall Eq. 3.5 where the link function for the mixture models was defined and consider the case where λ is given the following structure:

λ =        exp(M ∗ time+a−0.5c+αi) , z=1 exp(M ∗ time+a+0.5c+bi) , z=2 (4.3) where αi ∼ N (a − 0.5c, σα2) and bi ∼ N (a + 0.5c, σ2b)

Model 3 is very similar to Model 1 but has the addition of the second random intercept. Results from simulation 1 indicate that c was underestimated and a, slightly overestimated. Sigma was substantially larger than expected (0.11 vs. 0.01). In contrast, Simulation 3 underestimated the absolute value of a (-1.3 vs. -1.4) but c is relatively well estimated. Both standard deviations of the random intercepts are underestimated (0.39 and 0.22 vs. 1). The estimates for group membership probabilties are relatively close.

(54)

Parameter Median Mean Standard Deviation True Value a -1.291 -1.300 0.126 -1.5 c 2.010 2.010 0.222 2 pi.1 0.802 0.798 0.051 0.80 pi.2 0.198 0.202 0.051 0.20 m 0.003 0.003 0.002 0.003 tau.1 4.233 6.561 6.829 1 sigma.1 0.236 0.392 0.527 1 tau.2 6.876 8.419 6.001 1 sigma.2 0.145 0.224 0.589 1

(55)

Figure 7: Graphs showing the density function from the 200 data sets run in Simulation 3. Top left - ”a”, top right - ”c”. Bottom left - ”sigma.1”, bottom right - ”sigma.2”.Red lines indicate ”truth”.

(56)

4.4

Model 4

Model 4 allows the two groups to have different linear trajectories over time. In order to manage label switching at each point of estimation, the prior distributions of M1 and M2 are dependent on two normal distributions such that M1 is chosen as the minimum of the two sampled parameters and M2 is chosen as the maximum of the two sampled parameters. Model 4 can be considered a situation where one group of subjects experiences an in-creasing number of exacerbations over time. Neither group varies more than the other but are generally split by the quantity of exacerbations - particularly in relationship to the time exposed to the disease.

Recall Eq. 3.5 where the link function for the mixture models was defined and consider the case where λ is given the following structure:

λ =        exp(M 1 ∗ time+a+αi) , z=1 exp(M 2 ∗ time+a+αi) , z=2 (4.4)

Model 4 was somewhat more difficult to simulate. In order to have reasonable values but also limit label-switching values for a had to be small and close to 0 while the slopes also did not deviate much from 0. Any slightly larger slopes immediately lead to event counts over 100. Obviously, this wouldn’t have clinical use. Figure 10 provides a reasonable illustration of the estimation of the two parameters of interest, M 1 and M 2 in Simulation 4.

References

Related documents

medical doctor in our team explained theories of epidemiology to us, how all epidemics had some kind of natural inbuilt flow to them, and that this might be a part of

- Information shall be given when a change in accounting principles has occurred, including the reasons for such a change. If a change is made retroactive, the

46 Konkreta exempel skulle kunna vara främjandeinsatser för affärsänglar/affärsängelnätverk, skapa arenor där aktörer från utbuds- och efterfrågesidan kan mötas eller

Exakt hur dessa verksamheter har uppstått studeras inte i detalj, men nyetableringar kan exempelvis vara ett resultat av avknoppningar från större företag inklusive

The increasing availability of data and attention to services has increased the understanding of the contribution of services to innovation and productivity in

Av tabellen framgår att det behövs utförlig information om de projekt som genomförs vid instituten. Då Tillväxtanalys ska föreslå en metod som kan visa hur institutens verksamhet

We set primary level of education as baseline level and, hence, its corresponding relative hazard, 1 ; is set to 1: To make the proposed adjustment comparable to previous work on

This feature of a frequency- dependent time window is also central when the wavelet transform is used to estimate a time-varying spectrum.. 3 Non-parametric