• No results found

Cox regression

In document Diet and risk of acute pancreatitis (Page 45-50)

3 Material and methods

3.6 Statistical analyses

3.6.1 Cox regression

Cox regression—the choice of analytical method in this thesis—is widely used for analysis of time-to-event data and has been so during the last decades (Cox, 1972).41 It estimates hazard ratios (HRs), which, roughly, can be interpreted as the multiplicative extent to which an exposure increases or decreases an incidence rate (Hernán, 2010). In the forthcoming subsections, I will discuss 2 important and general aspects of Cox regression (ie, time scale and proportional hazards assumption), followed by thesis-specific aspects (ie, modeling of exposures and covariates, sensitivity analyses, and subgroup analyses).

Also, for those readers who are more statistically oriented, the formal equation of the Cox regression model is given below for (i) the hazard function (ℎ[𝑡|𝐱𝑗]) for one subject (denoted “𝑗”) as a function of exposure 𝐱𝑗 and (ii) the hazard ratio (𝐻𝑅) comparing that subject with another subject (denoted “𝑚”).

39The terms event, outcome, and disease are equivalent and are, therefore, used interchangeably in this thesis.

40More formally, the hazard function is defined as the “probability that the event occurs in [an infinitively small]

time interval, given that the subject has survived to the beginning of the interval, divided by the time interval”

(Bellavia, 2015).

41Other analytical methods are accelerated failure time models and Royston-Parmar models.

(i) (ii)

3.6.1.1 Time scale

Historically, the most common time scale in Cox regression models has been the time-on-study time scale, defined as the time interval between the date of study entry and the date of study exit (due to disease occurrence and/or right-censoring) (Figure 3.5, left). However, it is becoming increasingly common to use attained age as the time scale (Cologne et al., 2012; Thiébaut & Bénichou, 2004) (Figure 3.5, right). Instead of entering the study at a fixed point in time, the participants are assumed to enter at their age at baseline and to exit at their age at disease occurrence and/or right-censoring. A great advantage with attained age as the time scale is that it directly controls for changes in the hazard function associated with changes in age (which is a very strong risk factor for most diseases), without needing to incorporate age as a separate covariate or to model it in a specific way.

The primary time scale used in the main analyses of Paper I, III, and IV was time-on-study, whereas it was attained age in the main analyses of Paper II and V.

Figure 3.5: Example of time-on-study (left) and attained age (right) as the time scale in a Cox regression model.

Crosses represent cases and dots represent right-censored observations. Modified from Bellavia, Discacciati, Bottai, Wolk, & Orsini (2015) with permission of Oxford University Press.

3.6.1.2 Proportional hazards assumption

The Cox regression model is a proportional hazards model, which, simply put, means that it assumes that the HRs are proportional over time or, even more simply put, that the HRs are the same at any point in time during the follow-up period (or at any value of age, if that is the time scale) (Bellavia, 2015; Discacciati, 2015). An example of proportional hazards is shown in the left panel of Figure 3.6, whereas an example of non-proportional hazards is shown in its right panel. The presentation of a single, average (time-fixed) Cox regression-derived HR of 0.94 is appropriate in the left scenario, but it is less so with a single, time-fixed Cox regression-derived HR of 0.89 in the right scenario (clearly, the HRs are much lower than that for younger ages). Therefore, to interpret the data correctly, it is important to test the assumption of proportional hazards in a Cox regression model and, if there is evidence of such a violation, to handle it appropriately. (For technical and practical details on how to evaluate and handle non-proportional hazards in Cox regression, which is beyond the scope of this thesis, please see Royston & Lambert [2011]

and Oskarsson [2015] [both references are based on the statistical software Stata].)

In this thesis, the proportional hazards assumption was tested by using the Schoenfeld test (Paper I, II, and IV) (Schoenfeld, 1980) and by modeling interactions42 between linear (Paper III and V) and flexible (Paper V) functions of analysis time and the exposures of interest (Discacciati, Oskarsson, & Orsini, 2015). However, there was no formal evidence of departure from the assumption of proportional hazards for the main exposure(s) in Paper I–V.

Figure 3.6: Examples of proportional hazards (left) and non-proportional hazards (right). The gray line represents the fixed HR that was derived from a standard Cox regression model and the black lines represent the time-varying HRs and 95% confidence intervals that were obtained from a Cox regression model in which the HRs were allowed to vary over the time scale (attained age). The HRs vary considerably across attained age in the right scenario, thereby indicating a violation of the proportional hazards assumption. The figures were produced by using the post-estimation command stphcoxrcs in Stata (Discacciati et al., 2015). Data were obtained from Nordenvall et al. (2016).

3.6.1.3 Modeling of exposures and covariates

In Paper I, the variables for fruit consumption and vegetable consumption were modeled in a continuous (using linear and spline functions) and a categorical (according to quintiles43) fashion. In the multivariable model, in which fruit and vegetables were included simultaneously, the following covariates44 were adjusted for: age, sex, education, cigarette smoking, alcohol intake, BMI, history of diabetes, and energy intake (see Paper I for the modeling of each covariate). Inevitably, there will always be missing data in epidemiological studies, especially in very large cohort studies with thousands of participants, which, in one way or another, must be accounted for. As detailed in section 3.5, the participants who had missing data on the main exposure(s) in Paper I–IV were excluded. Missing data on covariates (a total of 7% in the multivariable model of Paper I) were handled using the missing-indicator method in Paper I–IV, in which an extra category (an missing-indicator) is added to represent the missing data of each covariate (Knol et al., 2010).

42The term interaction (also known as effect modification) refers to the situation in which an exposure-outcome association differs by levels of another factor (in this case: the time scale).

43One quintile represents one-fifth of a population; one quartile represents one-fourth of a population; and one tertile represents one-third of a population.

44I define a covariate as a secondary variable that can either distort (a confounding variable), mediate (an intermediate variable), or modify (an interaction variable) an exposure-outcome association. The concepts of confounding, mediation, and interaction will be further discussed in later subsections and chapters.

In Paper II, the variable for glycemic load was modeled continuously (using linear and spline functions) and categorically (according to quartiles) in separate exposure models. The secondary exposures, that is, glycemic index and carbohydrate intake, were modeled as categorical variables (according to quartiles).

The multivariable model included age, sex, cigarette smoking, and alcohol intake (a total of 4% of missing data), although I tried to adjust for several other covariates (see Paper II for details of covariates and their modeling).

In Paper III, the variable for total fish consumption was modeled as a continuous variable (using linear and spline functions), while those for fatty fish consumption and lean fish consumption were modeled as categorical variables (≤0.5, 0.6–2.0, >2.0 servings/week). The multivariable model was adjusted for age, sex, education, cigarette smoking, alcohol intake, BMI, use of fish oil supplements, vegetable consumption, and history of diabetes and/or hyperlipidemia (a total of 6% of missing data) (even though, once again, several other covariates were considered; see Paper III for details of covariates and their modeling). In addition, in the analyses of subtypes of fish, the variables for fatty fish and lean fish were included in the same multivariable model.

In Paper IV, the variable for coffee consumption was modeled in a continuous (using linear and spline functions) and a categorical (<2, 2, 3–4, ≥5 cups/day) fashion. Included in the multivariable model were age, sex, education, cigarette smoking, alcohol intake, and physical activity (12% of missing data in total) (see Paper IV for details of covariates and their modeling, including those considered but not included).

In Paper V, the variable for the RFS was modeled as a continuous variable (using linear and spline functions) and also as a categorical variable (according to approximate tertiles). Multivariable models were adjusted for age, sex, education, cigarette smoking, alcohol intake, BMI, physical activity, history of diabetes and/or hyperlipidemia, the non-RFS, and energy intake (all assessed at baseline) as well as for length of hospital stay (used as a proxy for disease severity) and calendar year of diagnosis. To avoid a too low number of events per parameter in the Cox regression model, which might lead to systematic errors45 in HRs and 95% confidence intervals (CIs)46 (Vittinghoff & McCulloch, 2007), I modeled categorical covariates as binary variables and continuous covariates as linear variables (nota bene, binary variables and linear variables are counted as one-parameter variables) (see Paper V for the specific modeling of each covariate). For the same reason, and in contrast to Paper I–IV, multiple imputation by chained equations was used to handle missing data, including that on the main exposure (White, Royston, &

Wood, 2011). This is a statistical technique in which missing values are replaced (imputed) by predicted values from a multivariable regression model. By creating multiple data sets, as opposed to a single or a few data sets, the variability of the imputed values can be accounted for. The overall percentage of missing data was 19% and I created a total of 40 imputed data sets. The HRs from all data sets were combined using the so-called Rubin’s rule, which accounts for variations between and within data sets.

45Also known by the term bias. Different sources of bias, and their potential implication on the results of this thesis, will be discussed in detail in later chapters.

46A 95% CI is a range of values for which one can be 95% certain that the true value of the population lays.

As mentioned in the previous paragraphs, I used both linear and spline functions (more specifically, cubic spline functions [order of 3]) to model continuous variables (Orsini & Greenland, 2011). The assumption of a linear exposure-outcome association is common, intuitive, and leads to results that are easy to interpret (ie, a constant change in risk for each unit of change in the exposure), but there is no reason to believe that this assumption holds in all situations, especially if it has not been tested for. The use of cubic splines is a way to relax the linearity assumption, thereby allowing for non-linear exposure-outcome associations.47 The simplest, but perhaps not the most technically correct, analogy of cubic splines is to view them as pieces of a broken stick. Each piece is allowed to have its own, separate shape (a cubic function) and the pieces are then joined together at so-called knots. The result is a continuous curve that is smoothed at the knot boundaries (Figure 3.7). If such a curve is constrained to be linear before the first knot, and/or after the final knot, it is known as a restricted cubic spline. This constraint is to avoid instability at the tails of a covariate’s distribution. In Paper I–V, both-tail restricted cubic splines with 3 knots at fixed percentiles (10th, 50th, and 90th) of the exposure distribution were used.

3.6.1.4 Sensitivity analyses

A number of sensitivity analyses were performed in Paper I–IV, with the aim of examining how robust the results of the main analyses were (see Paper I–IV for details). In this thesis, I have chosen to highlight the sensitivity analyses that are listed in Table 3.3; the results of which may, or may not, have already been presented in the individual studies. Similarly, of all the sensitivity analyses that were performed in Paper V, I will only focus on that which examined changes in dietary intake, cigarette smoking, and alcohol drinking following a diagnosis of non-gallstone-related acute pancreatitis. To do so, the questionnaire data from 1997 were compared with that from 2009.

3.6.1.5 Subgroup analyses

A number of subgroup analyses were also performed in Paper I–IV, with the aim of examining whether the results of the main analyses differed by levels of other variables (so-called effect modification or interaction) (see Paper I–IV for details). More specifically, the aim was to assess biological interaction

47Other ways are by using fractional polynomials or by categorizing a continuous variable.

Figure 3.7: Example of a continuous variable that is modeled using cubic splines. The dashed vertical lines indicate the knot placement, with the horizontal distance between the lines being equal to the width of the cubic splines.

(which refers to aspects and understanding of biological mechanisms) rather than statistical interaction (which refers to aspects and improvements of data fitting) (Ahlbom & Alfredsson, 2005; Rothman, 2012).

I have chosen to highlight the subgroup analyses by alcohol intake in this thesis, which will be

“standardized” so that the results for each main exposure (modeled as a continuous variable) are presented according to strata of low and high alcohol intake (defined as 12 g or more per day).48 Subgroup analyses were not considered meaningful in Paper V because of the low number of cases of recurrent and progressive pancreatic disease.

Table 3.3: Sensitivity analyses of Paper I–IV

Sensitivity analysis Purpose

Adjusting for a joint multivariable model* To account for differences in the choice of covariates and their modeling Excluding potential intermediate factors† To account for factors that may mediate an exposure-outcome association Using attained age as time scale To account for differences in the choice of time scale

Using multiple imputation for missing data To account for how the missing data were handled

Excluding the first 2 years of the follow-up To account for exposure changes due to pre-clinical or chronic illnesses Applying a stricter outcome definition‡ To account for misclassification of the outcome (due to underdetection of

gallstones [Johnson & Lévy, 2010])

*Age, sex, education (≤12, >12 years), cigarette smoking (never smoker, past smoker with <10 or ≥10 pack-years, current smoker with <20 or ≥20 pack-years), alcohol intake (sex-specific quartiles of g/day), body mass index (<25, 25–29, ≥30 kg/m2), physical activity (<20, 20–40, >40 min/day), history of diabetes (yes, no) and hyperlipidemia (yes, no), energy intake (sex-specific quartiles of kcal/day), fruit consumption (quintiles of servings/day), vegetable consumption (quintiles of servings/day), glycemic load (quartiles of score/day), total fish consumption (<1.0, 1.0–1.9, 2.0–3.0, >3.0 servings/week), and coffee consumption (<2, 2, 3–4, ≥5 cups/day).

†Fruit and vegetable consumption (Appleton et al., 2016; Carter et al., 2010; Yuan, Lee, Shin, Stampfer, & Cho, 2015), high-glycemic load diets (Bhupathiraju et al., 2014; Levitan et al., 2008; Murakami, McCaffrey, & Livingstone, 2013), fish consumption (Eslick, Howe, Smith, Priest, & Bensoussan, 2009; Gunnarsdottir et al., 2008), and coffee consumption (O’Keefe et al., 2013;

Rebello & van Dam, 2013) might affect the occurrence of obesity/adiposity, diabetes, and hypertriglyceridemia—that is, potential risk factors for acute pancreatitis (as summarized in Table 1.1).

‡No history of cholelithiasis and/or gallbladder and bile duct surgeries within 3 years after the index episode (or for as long as there were post-diagnosis follow-up data if the follow-up was less than 3 years) of non-gallstone-related acute pancreatitis.

In document Diet and risk of acute pancreatitis (Page 45-50)

Related documents