Birthweight-specific neonatal health With application on data from a tertiary
hospital in Tanzania
by Elisabeth Dahlqwist
Independent Thesis Advanced Level Department of Statistics
Uppsala University
Supervisor: Inger Persson
2014
Abstract
The following study analyzes birthweight-specific neonatal health using a combina- tion of a mixture model and logistic regression: the extended Parametric Mixture of Logistic Regression. The data are collected from the Obstetric database at Muhimbili National Hospital in Dar es Salaam, Tanzania and the years 2009 -2013 are used in the analysis. Due to rounding in the birthweight data a novel method to adjust for rounding when estimating a mixture model is applied. The influence of rounding on the estimates is then investigated.
A three-component model is selected. The variables used in the analysis of neonatal health are early neonatal mortality, if the mother has HIV, anaemia, is a private patient and if the neonate is born after 36 completed weeks of gestation. It can be concluded that the mortality rates are high especially for low birthweights (2000 or less) in the estimated first and second components. However, due to wide confidence bounds it is hard to draw conclusions from the data.
Keywords: Birthweight, heaping, neonatal mortality, Gaussian mixture model,
logistic regression
Contents
1 Introduction 1
1.1 Birthweight and neonatal health . . . . 1 1.1.1 Parametric methods to analyze birthweight-specific neonatal health . 2 1.1.2 The concepts of rounding, heaping and digit preference . . . . 3 1.2 Aim and research questions . . . . 5
2 Analyzing birthweight-specific neonatal health 6
2.1 The finite Gaussian mixture model . . . . 6 2.2 A semi-parametric method . . . . 7 2.3 A parametric method to analyze birthweight-specific neonatal mortality . . . 8 2.4 An extension of the Parametric Mixture of Logistic Regressions (PMLR) . . 11 2.4.1 Order of the Gaussian mixture model . . . . 12 2.4.2 Confidence intervals . . . . 14 2.5 The Gaussian mixture model for rounded data . . . . 18
3 Data 20
3.1 Variables . . . . 23
4 Results 25
4.1 Comparison of adjusted and non-adjusted estimates and model selection . . 25 4.1.1 Adjusting for heaping: FLIC to select the number of components . . 25 4.1.2 Adjusting for heaping: Parameter estimates . . . . 26 4.1.3 Adjusting for heaping: mortality curves and risk estimates . . . . 27 4.2 Description of the final model using Mixture models and Parametric Mixture
of Logistic Regressions, PMLR . . . . 29 4.2.1 Summary: Early neonatal mortality rates . . . . 32 4.2.2 Descriptive analysis of birthweight-specific patterns in each component 34 4.2.3 Summary: Component specific characteristics . . . . 38
5 Discussion 40
5.1 Impact of heaping . . . . 40 5.2 Epidemiological findings . . . . 41
6 Suggestions for further studies 43
7 Acknowledgements 45
8 References 46
9 Appendix 51
9.1 Appendix 1: A literature review on analysis of rounding and heaping . . . . 51 9.2 Appendix 2: Expectation-Maximization algorithm . . . . 52 9.3 Appendix 3: Additional results . . . . 56 9.4 Appendix 4: Completeness and accuracy of variables related to the analysis
of birthweight and neonatal mortality in the Obstetric database at Muhimbili
National Hospital 1999-2013. . . . 58
1 Introduction
This study was partly conducted at Muhimbili National Hospital, a tertiary hospital in Dar es Salaam, Tanzania. My fundamental idea with the study was to understand the conditions under which data to the Obstetric database at Muhimbili National Hospital (MNH) are collected. Information from my observations as well as descriptive statistical analysis of data from the database are presented in a comprehensive quality evaluation (Appendix 4). This evaluation focuses on the variables that are included in this study. That is, variables related to the analysis of birthweight and neonatal health.
1.1 Birthweight and neonatal health
Maternal and child health are important issues concerning all societies. In 2000 the reduction of child mortality and improvement of maternal health in developing countries was chosen as two of the eight Millennium Goals set up by the United Nations (UN, 2014). It is evident that in order to better understand and follow the development in the care of children and mothers the existence of high quality data as well as suitable statistical methods to analyze data is of significant importance.
Birthweight is an important variable in this context. The World Health Organization (WHO)
has classified a birthweight lower than 2500 grams as Low birthweight (LBW) and used this
classifier as an indicator of poor neonatal health ( WHO, 2014; Bian et al., 2012). However,
birthweight is a continuous variable and dichotomizing it by using a cut-off at 2500 gram
might give misleading results. An example of how the LBW classifier can be misleading
is that two neonates with the same birthweight can have different mortality risks due to
confounding factors that are influencing neonatal mortality. Moreover, it has been argued
that birthweight is not even on the causal pathway to neonatal mortality (Basso et al.,
2006; Wilcox, 2001). These two problems with using LBW as an indicator of neonatal
health has created a debate about the usefulness of the LBW classifier as well as a search
for alternative methods. The methods developed by Wilcox et al. and Gage et al. use
the birthweight distribution with its fat left tail. Previous research has shown that one Gaussian distribution and two residual distributions with unknown parametric distribution (Umbach and Wilcox, 1996) or two Gaussian distributions (Gage, 2009) have given a good fit. These sub-distributions that are fitted in a mixture model are usually denoted as ’sub- populations’ or more generally, components. The word ’components’ will be used in this study.
Moreover, the interesting question is if these components carry some epidemiological meaning.
It has been shown that the mortality curves for the components in the mixture distributions show different shapes indicating heterogeneity in the birth-cohort (Gage, 2009). One in- terpretation is that the component showing the overall highest risk for neonatal mortality could be regarded as the population at risk. If this is the case, mixture models based on birthweight could be used to control for unobserved heterogeneity in the birth-cohort and thus enable further statistical analysis assuming homogeneity in each component. Moreover, a framework for analyzing other characteristics than mortality of the components has been developed (Deng et al., 2011). The following sections will give some introduction to how birthweight-specific neonatal mortality can be modeled as well as an introduction to the concept of heaping and rounding.
1.1.1 Parametric methods to analyze birthweight-specific neonatal health
The finite Gaussian mixture model has initially been used as a tool to identify normal and
compromised subpopulations using birthweight (Umbach and Wilcox, 1996) and extended
to discover different patterns of neonatal mortality or risk (Charnigo et al., 2010a; Deng et
al., 2011; Gage, 2009; Gage and Therriault, 1998). The two main approaches can be divided
thusly; first, the semi-parametric approach formulated by Wilcox and Umbach and second,
the parametric approach called ’Parametric Mixture of Logistic regressions’ (PMLR) formu-
lated by Gage and colleagues (Gage, 2009; Gage and Therriault, 1998). In the parametric
PMLR approach the mixtures are used to control for unobserved heterogeneity when using
logistic regression with probabilistically assigned component memberships within each com-
ponent (Gage, 2009). The PMLR has later been extended by Deng, Charnigo and colleagues
and their extension will be explained and applied in this study (Charnigo et al., 2010a;
Charnigo et al., 2010b; Deng et al., 2011).
One potential problem when using the parametric approach is that it theoretically demands exact data on birthweight. As birthweight is measured in grams it is assumed to be continuous and since the gram is an exact unit of measurement, birthweights can be regarded as measured on a ratio scale. In the ideal setting the digital weighing machine generates birthweight data that are immediately reported to a computerized database. This procedure would generate perfectly recorded birthweights which fulfills the theoretical assumptions made about the data. If this is the case the use of parametric methods can be justified. Unfortunately, previous studies have discovered rounding or heaping in the birthweight variable in different contexts (Dellaportas et al., 1996; Edouard and Senthilselvan, 1997; Emmerson and Roberts, 2013), see Section 1.1.2. Rounding and heaping violates the assumption of continuous data.
Moreover, the phenomenon of rounding or heaping might be more present in contexts lacking proper weighing instruments as well as awareness of how rounding can constitute a statistical problem. In a comprehensive quality evaluation made by the author of the Obstetric database at Muhimbili National Hospital (MNH) in Tanzania an extensive problem with heaping in the birthweight variable is described (Appendix 4).
Leaving some of the epidemiologically interesting aspects behind, the aim with this study is to apply the extended Parametric Mixture of Logistic Regression approach on obstetric data from Muhimbili National Hospital and adjust for heaping. Moreover, even if this study does not focus on the epidemiological aspects some notes on the medical findings will be given.
1.1.2 The concepts of rounding, heaping and digit preference
The concept of rounding, heaping or digit preference has for a long time been recognized in the statistical and medical literature (see e.g. Bar and Lillard, 2012; Edouard and Senthilselvan, 1997; Emmerson and Roberts, 2013; Hasselstr¨om et al., 2013; Heitjan and Rubin, 1991;
Heitjan, 1989). Moreover, there are a few related concepts that to some extent imply the
same basic phenomenon. These are coarse data, rounding,heaping and digit preference.
’Rounding’ can be considered as an elementary form of coarse data where values are rounded approximately to the same extent (Heitjan and Rubin, 1991). ’Heaping’ denotes the phe- nomenon when values are rounded with different degrees of exactness (Wang, H.,2009) while
’Digit preference’ are non-uniform rounding for all values. Instead the rounding depends on how close the value is to a ’preferred’ value (Camarda, C.G. et al., 2008).
A simulated example of heaping is shown in Figure 1. The data is simulated from a mix- ture of two normal distributions and represent a possible birthweight distribution with data measured in kilos. All values are rounded to one decimal, representing rounding, and the chimneys represent preferred values. This mix of two different types of rounding represent heaping.
Figure 1: A simulated example of rounding and heaping in the birthweight variable using a two component Gaussian mixture model, n= 100 000
The implications of heaping are discussed in a range of areas, see Appendix 1, and this study will briefly discuss its implications for the analysis of birthweight-specific neonatal health.
Hence, this study will merge two fields that are somewhat overlapping, one that focuses
on the analysis of birthweight in relation to neonatal health and one that aims to correct
for heaping in the birthweight variable. With this background, the aim and the research questions are now given.
1.2 Aim and research questions
The main aim with this study is to apply PMLR framework for analyzing birthweight-specific neonatal health extended by Charnigo and colleagues to the data from the Obstetric database at MNH. From the quality evaluation it is evident that there is a problem with heaping.
Thus, the framework will also be adjusted for heaping by the method developed by Zhao and Bai (2012), see Section 2.5, in order to evaluate the influence of heaping on the parameter estimates.
Hence, the first part is to investigate birthweight-specific neonatal mortality and prevalence of risk factor such as HIV, private patient and preterm in the different components. The second part is to use the adjusted estimates to evaluate the influence of heaping. These two parts are connected as the second part gives some implications on the accuracy of the estimates in the first part. The research questions of this study are:
• Firstly, how can the components in a finite Gaussian mixture model be characterized by risk factors using data from the Obstetric database at Muhimbili Nationa Hospital, Dar es Salaam, Tanzania?
• Secondly, what are the implications on the parameter estimates in a mixture model for analyzing birthweight when not adjusting for heaping in the birthweight variable?
Moreover, the general goal is to give an empirical example of pitfalls and possibilities when
using data in a low-income setting.
2 Analyzing birthweight-specific neonatal health
This section presents some mathematical background and practical information needed in order to understand how to fit finite Gaussian mixture models and how to use the Parametric Mixture of Logistic Regressions to analyze birthweight-specific neonatal health.
2.1 The finite Gaussian mixture model
The finite-dimensional mixture model was originally developed by Karl Pearson as a method to model asymmetrical or skewed data (McLachlan and Peel, 2000). Pearson applied a two- component Gaussian mixture model as a tool to use the forehead to body length of crabs to distinguish between two subspecies within the overall population of crabs. One attractive feature of the mixture model is that it is semi-parametric as not one single parametric model is assumed for the whole population. Instead, the population is modeled as a mix of sub- populations of some specified distributions (McLachlan and Peel, 2000). Component mem- bership (such as species) is not always known and in these cases the iterative Expectation- Maximization (EM) algorithm is used to probabilistically assign component membership to each observation and achieve maximum likelihood (ML) estimates of the mixture parameters (McLachlan and Peel, 2000; McLachlan and Thriyambakam, 2000).
In Equation 1 the definition of the Gaussian mixture model with g components is presented.
f (y
j) =
!
g i=1p
i× N(y
j| µ
i, σ
i) where N (y
j| µ
i, σ
i) = f (y
i| µ
i, σ
i) = 1
"
2πσ
2ie
−(yj−µi)2 2σ2i
(1) where g denotes the number of components such that i = 1, . . . , g and n be the number of observations, j = 1, . . . , n. Moreover, let p
idenote the mixing proportion which is always non-negative and smaller than unity so that the sum of all mixing proportions sum to unity.
Thus, the conditions in Equations 2 and 3 should be fulfilled (McLachlan , 2000).
0 ≤ p
i≤ 1 (i = 1, . . . , g) (2)
and
!
g i=1p
i= 1 (3)
The definition of the mixture model, Gaussian mixture model and the general approach of estimation using the Expectation-Maximization algorithm is presented in Appendix 2 and is well described in (McLachlan and Peel, 2000; McLachlan and Thriyambakam, 2000).
The following subsections will further explain how parametric and semi-parametric versions of the Gaussian mixture model can be used when analyzing birthweight-specific neonatal health.
2.2 A semi-parametric method
A semi-parametric three component finite mixture model for measuring epidemiologically
useful characteristics of the birthweight distribution is presented by Wilcox and Umbach
(Umbach and Wilcox, 1996). This model assumes that the birthweight distribution is accu-
rately described by a main component following a normal distribution and consisting of all
regular births and two residual components following no known distribution and consisting of
high-risk births in the residual left and right tails of the main distribution. This approach is
regarded as semi-parametric since no distributional assumption is made on the two residual
distributions. The method by Wilcox and Umbach uses the birthweights compressed into
bins of 100-grams and is therefore more robust to rounding and heaping.
2.3 A parametric method to analyze birthweight-specific neonatal mortality
There are many factors that influence infant mortality, some which are observable and usually reported to Obstetric databases, such as the diagnosis of the mother, mother’s educational background and social status and complications during pregnancy. Other factors that might influence neonatal mortality such as the overall health of the mother and child, the lifestyle of the mother and nutritional intake are harder to observe and measure correctly. More- over, there might be additional factors that are latent and unobserved. In order to analyze birthweight-specific neonatal mortality by logistic regression one of the main assumptions is that the model is correctly specified and control for all factors that influence infant mortal- ity, that is , there is no neglected heterogeneity (Wooldridge, 2010, p.582). By including all factors we can regard the birth-cohort as homogeneous since the factors that induce hetero- geneity are controlled for. However, as described, there are many factors that need to be controlled for, observable as well as latent. The Parametric Mixture of Logistic Regressions has been developed by Gage (2009) with the purpose to use logistic regression within each component to control for unobserved heterogeneity (Gage, 2009).
Figure 2: Simulated birthweights, n= 5 000 (using R-code from Charnigo et.al. 2010a) (a)A two-component Gaussian mixture model, (b) A four-component Gaussian mixture model fitted
The Parametric Mixture of Logistic Regressions approach assumes that the birthweight dis- tribution can be correctly modeled by a two component normal mixture model. The two components are assumed to represent the babies undergoing normal and compromised fetal development. The first graph in Figure 2 shows a two-component Gaussian mixture model that is fitted to simulated data from a four-component mixture model of 5000 observations.
The second graph in Figure 2 depicts the simulated data modeled by a four component model.
One of the main interests when using Gaussian mixture models is to evaluate birthweight- specific neonatal mortality rates within each component. This analysis is possible to perform by a reformulated version of the logistic regression adopted for the setting when component membership is only probabilistically known (Gage, 2009). Since the component memberships are only probabilistically known it is assumed that the mortality rate of the population at the specific birthweight y
jis a weighted sum of the mortality at the two components at birthweight y
j. The standardized weights are defined in Equation 4 according to Gage (2009):
ˆ q(y
j) =
#
1ˆ p1
$ f (y
j| ˆµ
1, ˆ σ
1)
#
1ˆ p1
$ f (y
j| ˆµ
1, ˆ σ
1) + #
1ˆ p2
$ f (y
j| ˆµ
2, ˆ σ
2) (4)
The procedure contains two steps and thus, the weights in Equation 4 are defined as a function of birthweights conditioned on the Maximum-Likelihood parameter estimates of the mixture model, ˆ p
i, ˆ µ
iand ˆ σ
i, i = 1, 2, where ˆ p
iis the estimated proportion of the sample that belong to component i. As the first step is to estimate the parameters in the mixture model, the weights and the model parameters are regarded as posterior probabilities of component membership when later estimating the logistic regressions in every component.
The birthweight-specific mortality rates, denoted as h(y
j), are estimated according to the following Equation 5 (Gage, 2009):
h(y
j) =
% e
(!2l=0ˆb1l(yj))1 + e
(!2l=0ˆb1l(yj))q(y ˆ
j) + e
(!2l=0ˆb2l(yj))1 + e
(!2l=0ˆb2l(yj))[1 − ˆq(y
j)]
&
(5)
The probabilities,
e(!2l=0 bil(yj ))
1+e(!2l=0bil(yj ))
represent the component-specific probabilities of dying in the
two components. The logistic regression can be expressed as Equation 6 (Gage, 2009):
!
2 l=0ˆb
1,l(y
j) = ˆb
1,0+ ˆb
1,1y
j+ ˆb
1,2y
j2(6)
where ˆb
1l(y
j) and ˆb
2l(y
j) are the component specific estimates for the covariates. Gage ar- gue that a second degree polynomial is a good choice as it captures the U-shape of the birthweight-specific mortality for each component which has been shown in previous studies on homogeneous birth-cohorts (Gage, 2009). The estimates of the component specific param- eters in the logistic regressions are fitted by minimizing the sum of the weighted probabilities in Equation 7 using the paired observations on mortality, x
j, and birthweights, y
j, across all births (Gage, 2009).
−ln
% e
(xj!2l=0ˆb1l(yj))1 + e
(!2l=0ˆb1l(yj))q(y ˆ
j) + e
(xj!2l=0ˆb2l(yj))1 + e
(!2l=0ˆb2l(yj))[1 − ˆq(y
j)]
&
(7)
One very important aspect is that the Gaussian mixture model is heavily based on the assumption that the sub-populations are normally distributed. The Gaussian mixture model that is specified by Gage also assumes that the number of components are two. In reality, the actual number of components in the population of births is unknown. As it can be argued that the number of components can vary between populations, having a fixed number of components is problematic since it might ignore differences in the number of sub-populations between populations.
Charnigo and colleagues have extended the PMLR approach developed by Gage to a flexible number of components (Charnigo et al., 2010a). The generalized extension of the PMLR tech- nique allows for estimating the number of components using a Flexible Information Criterion (FLIC) developed by Charnigo and colleagues (Charnigo et al., 2010a). Another extension is to use the PMLR to understand how risk factors such as mothers’ diagnosis and gestational age (age of the fetus in weeks) can characterize the components in the mixture model (Deng et al., 2011).
Even though the problem with assuming that each component follows a normal distribution
is not discussed by Charnigo and colleagues their extension of the PMLR developed by
Gage is a novel and interesting approach to analyze neonatal health. Due to the problems
with unobserved heterogeneity it is interesting to apply the extended version of the model developed by Gage in order to analyze birthweight-specific neonatal health using the data from Muhimbili National University. The details of this framework will be explained next.
2.4 An extension of the Parametric Mixture of Logistic Regres- sions (PMLR)
The Parametric Mixture of Logistic regressions technique formulated by Gage and colleagues has been extended from containing 2 to g components in a finite Gaussian mixture model to by Charnigo and colleagues. The risk function or the birthweight-specific mortality for component i (1 ≤ i ≤ g) can be expressed as in Equation 8 (Charnigo et al., 2010b):
ˆ
r
i(y
j) = e
(!4l=0ˆbi,l(yj))1 + e
(!2l=0ˆbi,l(yj))(8) where y
jrepresents the birthweight for individual j and '
4l=0
ˆb
i,l(y
j) represents the logistic regression function for every birthweight (Charnigo et al., 2010b). In the two component case (Gage, 2009) it was assumed that a second-degree polynomial could describe the mor- tality curves for each component. However, Charnigo et al. argue that letting ˆb
i,l(y
j) to be a four-degree polynomial allows for up to two changes in convexity for each birthweight- specific mortality curve (Charnigo et al., 2010b). Thus, the logistic regression function to be estimated is formulated as in Equation 9 (Charnigo et al., 2010b).
!
4 l=0ˆb
i,l(y
j) = ˆb
i,0+ ˆb
i,1y
j+ ˆb
i,2y
2j+ ˆb
i,3y
j3+ ˆb
i,4y
j4(9)
where i = 1, . . . , g and j = 1, . . . , n. For all g components the population overall risk-function can be expressed as in Equation 10 by using the law of total probability:
'
gi=1
r ˆ
i(y
j)ˆ p
if (y
j; ˆ µ
i, ˆ σ
i) '
gi=1
p ˆ
if (y
j; ˆ µ
i, ˆ σ
i) (10)
The parameters ˆ r
i(y
j) in Equation 10 are estimated using the optimization (optim in R, R foundation for statistical computing, Vienna, Austria, 2009) by conditioning on the first step where the parameters in the normal-components have been estimated by the EM-algorithm and optimization. As was noted before, one of the fundamental differences in the approach developed by Charnigo et al. and Gage et al. is that the former allows for selecting the number of components that show the best fit to the data. The assessment of ’best fit’ is to use a penalized likelihood function as an information criterion. Charnigo and Pilla (Pilla and Charnigo, unpublished) have developed the Flexible Information Criterion (FLIC) that is used to select the number of components. This is presented shortly in the next section.
2.4.1 Order of the Gaussian mixture model
The procedures to select the order of a finite mixture model are many and can roughly be divided into formal tests and a criterion based on a penalized likelihood (McLachlan p.220, Li 2012). In the new extended framework that generalizes the PMLR procedure the Flexible Information Criterion (FLIC) (Equation 11) is used as the criterion for selecting the number of components (Charnigo et al., 2010a). This criterion is developed by Pilla and Charnigo (Pilla and Charnigo, unpublished) and is formulated in Equation 11:
F LIC
g= −2logL
g+ 2(log √
n)
B(n,δ)(3g − 1) (11)
where g denotes the number of components, n the sample size and the function B(n, δ) is explained in Equations 14 and 15. Two other commonly used information criteria are the Akaike Information Criterion (AIC), Equation 12, and Bayesian Information Criterion (BIC) ,Equation 13.
AIC
g= −2logL
g+ 2(3g − 1) (12)
BIC
g= −2logL
g+ logn(3g − 1) (13)
The difference between the criterias in Equations 12, 13 and 11 is the penalty term B(n, δ) in Equation 11. This term is not only determined by the sample size but also by the con- figuration of the data points so that more heterogeneity in the sample allows for selecting a model with more components. This is approximated by the penalty statistic that uses the fraction of between-component variability to total variability for the estimated g-component model as described in Equation 14 (Pilla and Charnigo, unpublished)
δ(Y ) = 1 − 1 g
!
G g=2( 1 S
2!
g i=1ˆ
p
i,g)ˆ θ
i,g− ¯ Y *
2+
(14)
where g denotes the number of components and G the model with the maximum number of components. Let ˆ θ
i,gdenote the mean response function. Moreover, ¯ Y and S
2are the sample mean and variance and ˆ p
i,gis the estimated mixing proportions for each model. The term B(n, δ) in Equation 11 is denoted as the ’bivariate ratio function’ and is defined in Equation 15:
B(n, δ) = Φ[(log √
n)
δ] − Φ(1)
1 − Φ(1) (15)
where Φ( ·) denotes the standard Gaussian cumulative distribution function. The bivariate ratio function in Equation 15 is defined for integers n > exp(2) and real δ ∈ [G
−1, 1].
Moreover, it should be noted that B(n, δ) is nonnegative and increasing in both n and δ (Pilla and Charnigo, unpublished).
Charnigo and colleagues have chosen FLIC as the main model selection criterion in their PMLR framework (Charnigo et al., 2010a). The reason for this is that a simulation study comparing the three criterias; Akaike Information Criterion (AIC) (12), Bayesian Information Criterion (BIC) (13) and FLIC has shown that while BIC and FLIC performed better for large samples FLIC and AIC performed better for small samples (Charnigo et al., 2010a).
Thus, in comparison with AIC and BIC FLIC performs better.
It should be kept in mind that FLIC depends on the sample size in combination with the
number of true components (Charnigo et al., 2010a). A possible solution to this problem is
to resample from the original sample with a fixed sample size of 50 000 observations so that the result from the FLIC can be comparable with other populations (Charnigo et al., 2010a).
This is employed in the further generalized approach by Deng and colleagues (Deng, et al., 2011).
We continue with the next section which will give a brief explanation of how to access the uncertainty in the mixture parameter estimates as well as how to construct confidence intervals and compare components.
2.4.2 Confidence intervals
The main drawbacks of the EM algorithm is that it does not produce an analytical variance and covariance matrix of the estimates such as the Fisher’s scoring algorithm and similar methods do (McLachlan, 2000). In the framework developed by Charnigo and colleagues the problem of no analytical covariance matrices is solved by using resampling. As Charnigo and colleagues have worked with large sample cases (more than 200 000 observations) the computation of the EM-algorithm is slow when resampling from the whole dataset. Instead, resampling with replacement is made on n=50 000 observations. These resamples are then used to select the the number of components using FLIC and assess uncertainty in the pa- rameter estimates (Deng et al., 2011). Since 50 000 observations are drawn with replacement Charnigo and colleagues have formulated approximate 95 % confidence intervals that account for overlapping samples (Charnigo et al., 2010b).
The confidence intervals are estimated in the following way in the large sample case. The
uncertainty in parameter estimates is estimated by resampling a sample of n=50 000 outcome
pairs of birthweight and mortality (or any other risk factor) N
reptimes with replacement. The
mixture parameters as well as the parameters in the logistic regression are then estimated
using these meta samples, samples of a sample (Charnigo et al., 2010b). Let θ represent
any parameter to be estimated for making inference. For each one of the N
repsamples the
parameter is estimated giving the meta sample, ˆ θ
1, . . . , ˆ θ
Nrep, of which a meta sample mean
is estimated ˆ θ
meta(Equation 16), (Charnigo et al., 2010a).
θ ˆ
meta=
'
Nrepl=1
θ ˆ
lN
rep(16)
where N
reprepresents the number of resamplings. The corresponding meta sample standard deviation ˆ S
θis constructed according to Equation 17:
S ˆ
θ=
,'
Nrepl=1
(ˆ θ
l− ˆθ
meta)
2(N
rep− 1) (17)
The 95 percent confidence interval can then be constructed as in Equation 18.
θ ˆ
meta± C
0× S ˆ
θ"
N
rep(18)
In Equation 18, C
0can be chosen as the 0.025-quantile of the standard normal distribution if the meta sample, ˆ θ
1, . . . , ˆ θ
Nrep, is approximately normally distributed by the Central Limit Theorem. If the sample is small but the population is assumed to be normally distributed the 0.025-quantile of the t-distribution with (N
rep− 1) degrees of freedom can be used. The more conservative alternative is to choose C
0=
√0.051= 4.47 based on Chebychev’s inequality (Charnigo et al., 2010a). The later alternative is the most suitable one since inference for mixture parameters needs a very large sample size is needed in order for asymptotic theory to apply (Xiang et al.,2005; McLachlan , 2000).
Moreover, let C
φdenote the critical value that adjusts for overlaps in the resampling (Charnigo et al., 2010a, Additional file 1: Technical appendix). Let φ =
Nnwhere N is the original sam- ple size and n is the sample size within each resample N
rep. Then C
φcan be defined thusly (Equation 19) :
C
φ= C
0, φN
rep(1 − (1 − φ)
Nrep) (19)
However, if the number of observations within each resample is large and close to the total sample size, (n ≈ N) then C
φ≈ C
0"
N
rep.
When the total sample size is relatively small (around 50 000 observations) there is no need to draw resamples of n < N since the computing time will not differ much from taking n = N . Thus, for the small sample case the bootstrap method can be applied. The bootstrap approach is more or less equivalent to the resampling method described by Charnigo and colleagues(Charnigo et al.,2010a) but using n = N (Chernick, 1999). Using the bootstrap implies that N observations are sampled with replacement N
reptimes and yield the following confidence interval when inserted in Equation 18 (Charnigo et al., 2010a):
θ ˆ ± C
0× ˆ S
θ(20)
Another issue is that the mixture model parameters have non-negligible bias. Charnigo and colleagues have adjusted the confidence interval by adding the estimated bias ˆ B
θto the interval in Equation 18. This bias term is estimated from simulations in which the esti- mated parameters from the mixture model, ˆ p
1, ˆ µ
1, ˆ σ
1, . . . , ˆ p
g, ˆ µ
g, ˆ σ
gare used in 5 simulations.
The simulated data are then used to estimate the parameters in the previously selected g-component Gaussian mixture. The 5 simulated estimates are then compared with the es- timates from the mixture estimates from the whole dataset, ˆ θ. The following Equation 21 present the estimate of the mean bias of 5 simulations (Charnigo et al., 2010a).
B ˆ
θ= '
5k=1
| ˆθ
sim− ˆθ |
5 (21)
Where k = 1, ..., 5 and represent the number of simulations that are used to estimate the mean bias term B
θ. The confidence interval that takes the bias into account when n = N can then be formulated as in Equation 22 by using Equation 20 (Charnigo et al., 2010a).
θ ˆ ± ( ˆ B
θ+ C
0× ˆ S
θ) (22)
When making confidence bounds for the risk functions the procedure is analogous. For every
resample with the birthweight and risk factor outcome pair an overall estimate of the risk
function for component i is estimated for each of the N
represamples according to Equation 8, ˆ r
i,1(y
j), . . . , ˆ r
i,Nrep(y
j) as in Equation 23 (Charnigo et al., 2010a).
ˆ
r
i(y
j) = logit
−1- N
rep−1N
!
reps=1
logit {ˆr
i,s(y
j) } .
= exp /
N
rep−1'
Nreps=1
log (
ˆ ri,s(yj) 1−ˆri,s(yj)
+0
1 + exp /
N
rep−1'
Nreps=1
log (
ˆri,s(yj) 1−ˆri,s(yj)
+0 (23)
Where i = 1, . . . , g and g denotes the number of components. By fixing the birthweight to y = y
0we can define the parameter as θ = logit {r
i(y
0) }. For the N
represample the risk estimates are defined as logit {ˆr
i,1}, . . . , logit{ˆr
i,Nrep} and denoted as the meta sample, θ ˆ
1, . . . , ˆ θ
Nrepfor a fixed birthweight, y = y
0. From this meta sample a meta average, ˆ θ, and standard deviation, ˆ S
θ, are estimated. The confidence intervals are then constructed as in Equation 22. The bias is estimated analogously and according to Equation 21.
The number of replications, N
rep, is still a crucial question. Charnigo and colleagues use large datasets of 200 000 births or more and in that setting they recommend to draw N
rep= 25 replications by resampling with n=50 000. However, in their study a sample size of 200 000 births are denoted as small and the confidence intervals are wide for the extreme birthweights.
The dataset that is used in this study contains 40 340 observations and are only a fraction of 200 000. Thus, in this context this is a small sample situation. In order to improve the approximate 95 % confidence intervals N
repis selected to be as large as possible (limited by computing time, set to a maximum of 20 hours) which give approximately N
rep= 250.
Theoretically, the minimum number of necessary bootstraps vary from situation to situation
(Davidson et al., 2000). Charnigo and colleagues have shown by a simulation study that 25
resamples can be enough for a large original sample size (Charnigo et al., 2010a). This aspect
should be studied further in order to improve inference in small sample situations. However,
in this thesis 250 replications will be made on the whole sample of 40 340 observations, thus
N
rep= 250 and n = N is used.
2.5 The Gaussian mixture model for rounded data
One of the questions that will be explained in this study is how heaping and rounding can be adjusted for in the mixture context. In Section 1.1.2 the concepts of rounding, heaping and digit preference are clarified. A range of methods have been formulated to adjust for heaping, rounding and digit preference (see Appendix 1 for a literature overview). There are two especially interesting methods that have been formulated lately that deal with heaping, rounding and digit preference. The first one, the Composite Link Model (CLM) uses the Generalized Linear Model (GLM) framework to uncover the latent distribution of the heaped data. This is done by assuming a smooth latent distribution and modeling how the original data have been transferred to neighboring values using the Poisson distribution (Camarda et al., 2008; Eilers and Borgdorff, 2004; Thompson and Baker, 1981). The interesting aspect of this approach is the possibility to take the heaping mechanism into account in order to get a better fit to the model. However, this model gives an estimate for every possible value and for a large amount of possible values it might be computer intensive and thusly, more suitable for discret data.
The second method that has recently been developed for rounded and heaped data is for-
mulated by Bai and colleagues in a series of papers. They have developed a consistent and
asymptotically normally distributed estimate for rounded data in a time series setting (Bai et
al., 2009), for dependent data (Zhang et al., 2010) and for normal mixture models (Zhao and
Bai, 2012). Their work is founded on the work of Lee and Vardeman which has shown how to
construct asymptotically normal confidence intervals for rounded data (Lee and Vardeman,
2002). The work made by Bai and colleagues has focused on how to adjust for rounding in
existing methods. This way to adjust for heaping is a convenient alternative to compara-
tive methods (see Appendix 1). In a mixture setting Zhao and Bai (2012) have shown how
the EM-algorithm can be adjusted for rounding and heaping by defining the heaping interval
and incorporate this when estimating the mixture components. Since this method has proven
easy to incorporate when estimating mixtures and is less computer intensive compared with
the Composite Link Model (CLM) it will be used here. The method by Zhao and Bai (2012)
can be explained as follows.
Assume we want to estimate the parameters in a g-component mixture model, θ = (p
i, µ
i, σ
i), using the Expectation-Maximization algorithm when we have heaped data (the EM-algorithm is described in Appendix 2). Let the heaping interval be denoted as h, the observed and heaped values be represented by y
jfor individual j with the probability density function g(y
j| θ). Moreover, let the unobserved exact value for individual j be x
jwith the specified probability density function f (x
j| θ). The relationship between the observed and rounded y
jand the latent and exact x
jis defined in Equation 24 according to Zhao and Bai (2012).
Y
j= y
jif y
j− h
2 ≤ X
j< y
j+ h
2 (24)
where j = 1, . . . , n and n is the sample size.
In the case when y
jis rounded on the interval h the probability of observing y
jcan be expressed as in Equation 25 (Zhao and Bai, 2012).
g(y
j| θ) =
1
yj+h/2 yj−h/2f (x
j| θ) (25)
If x
jis assumed to come from a mixture of normal distributions as in Equation 26.
f (x
j| θ) = f(x
j| µ
i, σ
i) = 1
"
2πσ
i2e
−(yj−µi)2
2σ2i
(26)
where i = 1, . . . , g and g denotes the number of components in the mixture. The Log- Likelihood function for the rounded data can be expressed as Equation 27 (Zhao and Bai, 2012).
&
j(θ) =
!
n j=1log(g(y
j, θ)) (27)
By using the definition of the probability density of the rounded data in Equation 25 the
Log-Likelihood in Equation 27 can be adjusted for rounding and heaping. The definition of
the density function in Equation 25 is then used in the EM-algorithm to estimate the mixture
parameters θ. The derivation of this estimation procedure is described in detail in Zhao and Bai (2012).
The method has been shown to work well to correct for rounding in data and give consistent and unbiased estimates when applied in the EM-algorithm for the mixture model (Zhao and Bai, 2012). Moreover, the following sections will describe the data and the results.
3 Data
Many projects have been made with the aim to improve data quality when databases exist or to establish systems for data collection and management when they do not exist. One such project is the Obstetric database at Muhimbili National Hospital in Dar es Salaam, Tanzania. The Obstetric database was founded in 1998 and was a co-operation project between Uppsala University, Ume˚ a University and Muhimbili University of Health and Allied Sciences (MUHAS). The goal was to establish a research database as a first step in the development of a birth register at Muhimbili National Hospital. Over the years the database has been developed and the need for new variables, methods of measurement and reporting has changed. The Obstetric database has been collecting data for over 15 years and during the period 1999-2013 a total of 174 064 deliveries were recorded in the register.
A comprehensive quality evaluation made by the author on selected variables related to the analysis on neonatal health (Appendix 4) showed that the birthweights were heaped, rounded to different extent, and that there are problems in the assessment of gestational age. One other problem is that the survival of the neonate after admission is not followed up. Because of this the analysis is limited to ’very early neonatal mortality’ that is, if the neonate survives the first few hours or days of life. Lacking procedures to collect follow up data on the survival of the neonate implies that the mortality rates are underestimated.
The observations that have been selected are singletons that have been classified as born alive
and admitted or discharged, fresh stillbirths and early neonatal deaths. Thus, macerated
stillbirths will not be used in the analysis since the fetus has died in the womb some time
before delivery. In the years before 2004 macerated and fresh stillbirths were both reported as stillbirths. During 2004 there are still some cases in this category and because of this the years after 2004 are selected. However, the database has been under construction and the assignments of diagnosis as well as automatic controls for the data imputation has changed over the years. Due to this it is reasonable to select later years. Due to the improved quality over time the years 2009 to 2013 are selected for the analysis, yielding 40 340 complete observations for the final dataset. A comparison of the neonatal mortality rates is also made with the whole dataset. In Figure 3 the final dataset is described by a bar chart.
According to the definition given in Section 1.1.2 heaping denotes a mixture of values that are heaped to a varying extent. In Figure 3 it is evident that the data suffers from heaping since most of the values are rounded to the 100-gram bin while a small fraction of the values are seemingly exact. The values that can be assumed to be ’non-rounded’ are shown as a small sub-distribution in the bottom of the distribution in Figure 3.
In Figure 4 the data has been divided into those values that are rounded to the 100-gram
bin, graph (a), and those who are exact, graph (b). Of course not all values that have a
value ending with two zeros are rounded, however, here it is approximated to be so. The
proportion of values that end with two zeros are 86.79 percent. Thus, the absolute majority
of the reported birthweights end with two zeros and the rounding interval will be assumed
to be 100 grams. That is, the value h in Equation 24, Section 2.5, is assumed to be 100.
050010001500200025003000
Birthweights 2009−2013
n=40 340 Birthweight (kg)
Count
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6
Figure 3: Birthweights in the final dataset, 2009-2013, n=40 340
Figure 4: Heaping in the birthweight variable for data 2009-2013. (a) Distribution of the birthweights that are rounded to the 100-gram bin. (b) Distribution of the birthweights that are not rounded to the 100-gram bin.
3.1 Variables
The final dataset consists of babies born between the years 2009 to 2013, without those fetuses that are classified as macerated stillbirths. In Table 1 the selected variables are presented with year-specific proportions of each factor. All observations with missing values on any of the variables presented in Table 1 are excluded.
Year 2009 2010 2011 2012 2013 Average
Early neonatal mortality* 6.05 % 5.99 % 5.50 % 4.04 % 3.98 % 5.20%
Anemia 1.74 % 2.99 % 2.39 % 2.18 % 2.26 % 2.31 %
Malaria 0.11 % 0.10 % 0.03 % 0.04 % 0.00 % 0.06%
Diabetes 0.21 % 0.05 % 0.13 % 0.08 % 0.19 % 0.13 %
HIV 7.37 % 4.99 % 4.24 % 3.83 % 2.93 % 4.79 %
Hypertension 5.54 % 5.78 % 4.98 % 5.63 % 5.85 % 5.53 % Private patient 26.14 % 21.49 % 14.34 % 20.91 % 17.40 % 20.12 % Born after 35 weeks** 4.78 % 5.71 % 4.75 % 4.34 % 5.03 % 4.92 %
Births (n) 8953 8163 9123 7421 6680 40340
Table 1: Percent of early neonatal mortality and potential risk factors for every year. (*Early neonatal mortality exclude macerate stillbirths, ** after 35 weeks of completed gestation)
In Table 1 the percentage of early neonatal mortality (with macerated stillbirths excluded)
and other potential risk factors are described. The two risk factors that show the lowest
presence in the data are malaria and diabetes. The low rates might not be the result of
actual low rates but rather lacking procedures to test mothers for malaria and diabetes
before giving birth. Thus, malaria and diabetes are probably underestimated and due to
this they are not included in the analysis. A previous study conducted at the labor ward at
Muhimbili Hospital has reported a rate of 6.4% of 1 721 mothers tested for malaria during
2002-2003 (Kidanto et al., 2009). A reported number of 6.4% is considerably higher compared
with the percentages reported for each year in Table 1. Testing the mother for HIV is made
on a regular basis since mothers with HIV should be given special medical treatment in order
to prevent transmission of HIV to the fetus during labor.
A rough proxy for social status is if the mother is a private patient or not. Moreover, gestational age (the age of the fetus in weeks) is dichotomized into ’preterm’ here defined as a baby born before week 36. The definition of preterm is babies born before week 37 but due to mis-classifications of gestational age there is quite probable that many of the births recorded as born in week 36 are actually born in week 37 (Kidanto et al., 2009), see the quality evaluation in Appendix 4. One reason for this is that a medical examination of fundal height has been used to assess gestational age and at week 36 the fundus has reached its maximal height. Thus, weeks after the 36th are hard to assess using this method and probably many of those babies that have been classified as born in week 36 might actually be older. This is indicated by looking at the gestational-age specific birthweights, see Appendix 4. Detailed information on the birthweight variable, gestational age as well as missing values for some related variables are presented in Appendix 4.
Ethical clearance for using the data in order to perform a quality evaluation of the database
as well as conducting this analysis of birthweight-specific neonatal health was given by the
Research and Publications committee at Muhimbili University of Health and Allied Sciences
(MUHAS).
4 Results
The results of this study are presented in this section. First, the impact of adjusting for heaping is described. Secondly, the selected model is presented and the risk factors preterm, anaemia, private patient and HIV are depicted for each component using the Parametric Mixture of Logistic Regressions (PMLR). The analysis has been made using the statistical software R and Professor Richard Charnigo at University of Kentucky, U.S, has provided me with the original codes for the functions and graphs.
4.1 Comparison of adjusted and non-adjusted estimates and model selection
The following two subsections describe how the model selection criterion Flexible Information Criterion (FLIC), the estimated mixture parameters, their 95% confidence intervals, and mortality risk estimates are influenced when adjusting for heaping. The adjustment method developed by Zhao and Bai (2012) is described in Section 2.5, the heaping interval is assumed to be 100-grams.
4.1.1 Adjusting for heaping: FLIC to select the number of components
There are many criteria that have been developed to give a measure of the model-fit. The
FLIC-criterion is described in Section 2.4.1 and is the main tool for selecting the order of
the mixture model in the framework by Charnigo and colleagues. In the following Table 2
the Log-Likelihoods and the FLIC values are presented for the non-adjusted and adjusted
EM-algorithm and for models containing 2 to 7 components. The choice of a maximum 7
components is arbitrary but it is hard to theoretically justify the use of more components
according to previous epidemiological knowledge (Charnigo et al., 2010a). The model with
the lowest FLIC should be selected as the best model according to the framework by Charnigo
and colleagues (Charnigo et al., 2010a).
Non-adjusted Adjusted
Components: Log-Likelihood FLIC Log-Likelihood FLIC
2 -37834.78 75720.68 -37863.13 75778.15
3* -37785.95 75653.68* -37785.94 75654.91*
4 -37781.03 75674.50 -37781.14 75676.45
5 -37777.77 75698.66 -37781.31 75707.93
6 -37775.18 75724.15 -37790.24 75756.92
7 -37734.35 75673.15 -37795.47 75798.52
Table 2: Log-likelihoods and FLIC for models with 2-7 components and when adjusting for heaping
In Table 2 the FLIC is generally larger for the adjusted model. Hence, the FLIC might be underestimated when not adjusting for heaping.
For the adjusted model the Log-Likelihood does not monotonically increase for an increase in the number of components. The models with 4 and 5 components show a larger value of the Log-Likelihood compared with models with 6 or 7 components, see Table 2. This is problematic and should be studied further.
Moreover, for both the adjusted and non-adjusted cases a three component model is selected since the lowest value of FLIC can be found here. Next section describes the mixture param- eter estimates in a three-component model when the EM-algorithm is adjusted for heaping.
4.1.2 Adjusting for heaping: Parameter estimates
The following Table 3 shows a comparison of the estimates and 95% confidence intervals for a three-component mixture model with and without adjustment for heaping. The confidence intervals are based on bootstrap standard errors with 250 replications. In Equation 22, Section 2.4.2, we let C
0= 4 (recommended by Charnigo and colleagues ( 2010a)).
In Table 3 the mixture estimates of a three-component Gaussian mixture model are presented.
Let ˆ p
idenote the mixture proportion for component i, ˆ µ
iand ˆ σ
ibe the mean and standard
deviation that specify each normal mixture for component i = 1, 2, 3. The estimates in Table
Non-adjusted Adjusted
Component: Estimate 95 percent CI Estimate 95 percent CI
1 p ˆ
10.08 (0.07; 0.09) 0.08 (0.07; 0.09)
ˆ
µ
11719.56 (1670.98; 1768.14) 1718.87 (1633.30; 1804.44) ˆ
σ
1491.20 (473.21; 509.20) 490.91 (458.77; 523.06)
2 p ˆ
20.90 ( 0.83; 0.97) 0.90 (0.81; 0.99)
ˆ
µ
23104.49 (3096.45; 3112.53) 3104.59 (3092.47; 3116.70) ˆ
σ
2499.80 (484.22; 515.38) 499.99 (475.52; 524.46)
3 p ˆ
30.02 (-0.04; 0.08) 0.02 (-0.06; 0.10)
ˆ
µ
33878.72 (3069.46; 4687.98) 3897.27 ( 3069.23; 4725.30) ˆ
σ
3665.41 (533.47; 797.35) 661.58 (594.36; 728.79)
Table 3: Mixture parameter estimates for a three-component model when adjusting for heaping and not. With approximate 95 % confidence intervals based on bootstrap standard errors
3 are slightly different when adjusting for heaping. The change in the point estimates are small while the estimated 95% confidence intervals are wider for the adjusted parameter estimates. The wider confidence intervals imply higher standard errors when adjusting the EM-algorithm for heaping which in turn imply higher variability in the point estimates. This aspect is discussed further in the discussion part, Section 5.
Next section investigates how the differences observed in the adjusted and non-adjusted estimates (see Table 3) might influence the mortality risk estimates.
4.1.3 Adjusting for heaping: mortality curves and risk estimates
In the following Figure 5 graph (a) and (b) represent the estimated component-specific mor-
tality rates when adjusting for heaping. The vertical axis shows the mortality risk on a
logarithmic scale. In Figure 5 the two graphs show almost identical birthweight-specific mor-
tality curves, the dashed-lines. The main difference is that the approximate 95% confidence
intervals (solid lines) seem to vary in width for different birthweights. One such example can
be seen at around 2300 grams where the upper bound for component 3 crosses the mortality
curve for component 1 in graph (a). In Table 4 the point estimates of mortality risks are presented at 1000, 2000 and 3000 grams for component 1, 2000, 3000 and 4000 grams for component 2 and 2000, 3000 and 4000 grams for component 3. The point estimates are ex- pressed as risk of death per 1000 births and the confidence intervals are approximately 95%
(Charnigo et al., 2010b). Looking at Table 4, 6 out of the 9 estimated confidence intervals are wider for the estimates when the EM-algorithm is adjusted for heaping. This, even if the confidence bounds vary in size there seems to be a tendency for the confidence bounds for the adjusted estimates to be slightly wider.
Figure 5: Birthweight-specific risk of mortality (a) non-adjusted for heaping (b) adjusted for heaping. Solid lines are approxi- mate 95 % confidence intervals. Resamples=250 and n=40 340.
Regarding the point estimates of mortality risk per 1000 births in Table 4 adjusting for heaping has practically no impact. However, the confidence intervals are slightly wider for some of the adjusted estimates. It should be kept in mind that the standard errors to the confidence intervals are drawn randomly and thus, there might be some variation due to randomness. These aspects can be studied further by using some other method than resampling to assess the standard deviation or by conducting a simulation study.
The issues with using bootstrap and resampling to assess standard errors will be further
discussed in the discussion part, Section 5. The next section will give a description of the final
three-component model when the EM-algorithm is adjusted for heaping using the method
Estimated mortality per 1000 births
Non-adjusted Adjusted
Component: Weight: Risk* 95% CI Risk* 95% CI
1 1000 g 484.40 (373.20; 597.30) 484.40 (379.80; 590.40) 2000 g 54.80 (17.30; 159.90) 54.80 (14.20; 189.40) 3000 g 3.30 (10.80; 34.30) 2.90 (10.80; 38.70) 2 2000 g 204.90 (123.60; 320.30) 204.90 (110.20; 349.10)
3000 g 31.00 (25.40; 37.60) 31.00 (25.60; 37.40) 4000 g 25.00 (11.20; 54.80) 25.00 (10.90; 56.20) 3 2000 g 8.30 (0.30; 170.30) 8.30 (0.40; 145.50) 3000 g 12.30 (2.70; 54.20) 12.30 (2.60; 55.20) 4000 g 17.10 (0.00; 899.50 ) 17.10 (0.00; 902.20)
Table 4: Birthweight-specific non-adjusted and adjusted estimates of mortality risk. Approximate 95 % confidence intervals.
Resamples=250 and n=40 340. (*Risk=logit−1{ˆθ})
explained in Section 2.5 and with the defined heaping interval 100-grams.
4.2 Description of the final model using Mixture models and Para- metric Mixture of Logistic Regressions, PMLR
The following Figure 6, graph (a), shows the estimated three-component mixture model when heaping is adjusted for. The corresponding adjusted mixture estimates can be seen in Table 3. Let ˆ p
idenote the mixture proportion for component i, ˆ µ
iand ˆ σ
ibe the mean and standard deviation that specify each normal mixture for component i = 1, 2, 3.
In Figure 6 (a), the first component seems to represent the births in the fatter left tail well.
This component is estimated to contain 8 percent of the births while the main component
2 is estimated to contain 90 percent of all births. The third component is the smallest one
and is estimated to contain 2 percent of all births. However, the approximate 95 percentage
confidence bounds show that the proportion 0 lie within the interval and thus, it could be
the case that in the population there would be no third component. The component that has the largest estimated standard deviation is the third component , 661.58 grams 95% CI (594.36; 728.79). Both the first and second components have relatively similar estimated standard deviations on 491 grams 95% CI (458.77; 523.06) and 500 grams 95% CI (475.52;
524.46) respectively.
Estimates to a three-component Gaussian mixture model Component: Estimate 95 percent CI
1 p ˆ
10.08 (0.07; 0.09)
ˆ
µ
11718.87 (1633.30; 1804.44) ˆ
σ
1490.91 (458.77; 523.06)
2 p ˆ
20.90 (0.81; 0.99)
ˆ
µ
23104.59 (3092.47; 3116.70) ˆ
σ
2499.99 (475.52; 524.46)
3 p ˆ
30.02 (-0.06; 0.10)
ˆ
µ
33897.27 ( 3069.23; 4725.30) ˆ
σ
3661.58 (594.36; 728.79)
Table 5: Mixture parameter estimates for a three-component Gaussian mixture model when adjusting for heaping and not.
With approximate 95 % confidence intervals based on bootstrap standard errors. Using the adjusted EM-algorithm
The second graph (b) in Figure 6 depicts the estimated birth-weight specific mortality rates for the three components. A descriptive comparison between the components shows that the mortality curve for component 1 has a steeper slope compared with the mortality curve of the births estimated to belong to component 2. The difference in the slopes and that they appear to be somewhat parallel to each other indicates that the births in component 1 have a relatively lower of mortality for the same birthweights as the births that are estimated to belong to component 2.
In Figure 6 (b) the mortality rates are decreasing for higher birthweights in component 3.
One possible explanation for the decrease in mortality risk for high birthweight babies in
component 3 is that few babies have such a weight in the dataset for 2009-2013. Charnigo
and colleagues (2010b) have got similar results and explain it as an artificial decrease due to
Figure 6: A three-component Gaussian mixture model fitted to birthweights 2009-2013, n= 40 340. (a) Components, real data and mixture, (b) Estimated component-specific mortality curves and overall mortality. Using the adjusted EM-algorithm