Birthweight-specific neonatal health: With application on data from a tertiaryhospital in Tanzania

(1)

Birthweight-specific neonatal health With application on data from a tertiary

hospital in Tanzania

by Elisabeth Dahlqwist

Independent Thesis Advanced Level Department of Statistics

Uppsala University

Supervisor: Inger Persson

2014

(2)

Abstract

The following study analyzes birthweight-specific neonatal health using a combina- tion of a mixture model and logistic regression: the extended Parametric Mixture of Logistic Regression. The data are collected from the Obstetric database at Muhimbili National Hospital in Dar es Salaam, Tanzania and the years 2009 -2013 are used in the analysis. Due to rounding in the birthweight data a novel method to adjust for rounding when estimating a mixture model is applied. The influence of rounding on the estimates is then investigated.

A three-component model is selected. The variables used in the analysis of neonatal health are early neonatal mortality, if the mother has HIV, anaemia, is a private patient and if the neonate is born after 36 completed weeks of gestation. It can be concluded that the mortality rates are high especially for low birthweights (2000 or less) in the estimated first and second components. However, due to wide confidence bounds it is hard to draw conclusions from the data.

Keywords: Birthweight, heaping, neonatal mortality, Gaussian mixture model,

logistic regression

(3)

1 Introduction 1

1.1 Birthweight and neonatal health . . . . 1 1.1.1 Parametric methods to analyze birthweight-specific neonatal health . 2 1.1.2 The concepts of rounding, heaping and digit preference . . . . 3 1.2 Aim and research questions . . . . 5

2 Analyzing birthweight-specific neonatal health 6

2.1 The finite Gaussian mixture model . . . . 6 2.2 A semi-parametric method . . . . 7 2.3 A parametric method to analyze birthweight-specific neonatal mortality . . . 8 2.4 An extension of the Parametric Mixture of Logistic Regressions (PMLR) . . 11 2.4.1 Order of the Gaussian mixture model . . . . 12 2.4.2 Confidence intervals . . . . 14 2.5 The Gaussian mixture model for rounded data . . . . 18

3 Data 20

3.1 Variables . . . . 23

4 Results 25

4.1 Comparison of adjusted and non-adjusted estimates and model selection . . 25 4.1.1 Adjusting for heaping: FLIC to select the number of components . . 25 4.1.2 Adjusting for heaping: Parameter estimates . . . . 26 4.1.3 Adjusting for heaping: mortality curves and risk estimates . . . . 27 4.2 Description of the final model using Mixture models and Parametric Mixture

of Logistic Regressions, PMLR . . . . 29 4.2.1 Summary: Early neonatal mortality rates . . . . 32 4.2.2 Descriptive analysis of birthweight-specific patterns in each component 34 4.2.3 Summary: Component specific characteristics . . . . 38

5 Discussion 40

(4)

5.1 Impact of heaping . . . . 40 5.2 Epidemiological findings . . . . 41

6 Suggestions for further studies 43

7 Acknowledgements 45

8 References 46

9 Appendix 51

9.1 Appendix 1: A literature review on analysis of rounding and heaping . . . . 51 9.2 Appendix 2: Expectation-Maximization algorithm . . . . 52 9.3 Appendix 3: Additional results . . . . 56 9.4 Appendix 4: Completeness and accuracy of variables related to the analysis

of birthweight and neonatal mortality in the Obstetric database at Muhimbili

National Hospital 1999-2013. . . . 58

(5)

1 Introduction

This study was partly conducted at Muhimbili National Hospital, a tertiary hospital in Dar es Salaam, Tanzania. My fundamental idea with the study was to understand the conditions under which data to the Obstetric database at Muhimbili National Hospital (MNH) are collected. Information from my observations as well as descriptive statistical analysis of data from the database are presented in a comprehensive quality evaluation (Appendix 4). This evaluation focuses on the variables that are included in this study. That is, variables related to the analysis of birthweight and neonatal health.

1.1 Birthweight and neonatal health

Maternal and child health are important issues concerning all societies. In 2000 the reduction of child mortality and improvement of maternal health in developing countries was chosen as two of the eight Millennium Goals set up by the United Nations (UN, 2014). It is evident that in order to better understand and follow the development in the care of children and mothers the existence of high quality data as well as suitable statistical methods to analyze data is of significant importance.

Birthweight is an important variable in this context. The World Health Organization (WHO)

has classified a birthweight lower than 2500 grams as Low birthweight (LBW) and used this

classifier as an indicator of poor neonatal health ( WHO, 2014; Bian et al., 2012). However,

birthweight is a continuous variable and dichotomizing it by using a cut-off at 2500 gram

might give misleading results. An example of how the LBW classifier can be misleading

is that two neonates with the same birthweight can have different mortality risks due to

confounding factors that are influencing neonatal mortality. Moreover, it has been argued

that birthweight is not even on the causal pathway to neonatal mortality (Basso et al.,

2006; Wilcox, 2001). These two problems with using LBW as an indicator of neonatal

health has created a debate about the usefulness of the LBW classifier as well as a search

for alternative methods. The methods developed by Wilcox et al. and Gage et al. use

(6)

the birthweight distribution with its fat left tail. Previous research has shown that one Gaussian distribution and two residual distributions with unknown parametric distribution (Umbach and Wilcox, 1996) or two Gaussian distributions (Gage, 2009) have given a good fit. These sub-distributions that are fitted in a mixture model are usually denoted as ’sub- populations’ or more generally, components. The word ’components’ will be used in this study.

Moreover, the interesting question is if these components carry some epidemiological meaning.

It has been shown that the mortality curves for the components in the mixture distributions show different shapes indicating heterogeneity in the birth-cohort (Gage, 2009). One in- terpretation is that the component showing the overall highest risk for neonatal mortality could be regarded as the population at risk. If this is the case, mixture models based on birthweight could be used to control for unobserved heterogeneity in the birth-cohort and thus enable further statistical analysis assuming homogeneity in each component. Moreover, a framework for analyzing other characteristics than mortality of the components has been developed (Deng et al., 2011). The following sections will give some introduction to how birthweight-specific neonatal mortality can be modeled as well as an introduction to the concept of heaping and rounding.

1.1.1 Parametric methods to analyze birthweight-specific neonatal health

The finite Gaussian mixture model has initially been used as a tool to identify normal and

compromised subpopulations using birthweight (Umbach and Wilcox, 1996) and extended

to discover different patterns of neonatal mortality or risk (Charnigo et al., 2010a; Deng et

al., 2011; Gage, 2009; Gage and Therriault, 1998). The two main approaches can be divided

thusly; first, the semi-parametric approach formulated by Wilcox and Umbach and second,

the parametric approach called ’Parametric Mixture of Logistic regressions’ (PMLR) formu-

lated by Gage and colleagues (Gage, 2009; Gage and Therriault, 1998). In the parametric

PMLR approach the mixtures are used to control for unobserved heterogeneity when using

logistic regression with probabilistically assigned component memberships within each com-

ponent (Gage, 2009). The PMLR has later been extended by Deng, Charnigo and colleagues

(7)

and their extension will be explained and applied in this study (Charnigo et al., 2010a;

Charnigo et al., 2010b; Deng et al., 2011).

One potential problem when using the parametric approach is that it theoretically demands exact data on birthweight. As birthweight is measured in grams it is assumed to be continuous and since the gram is an exact unit of measurement, birthweights can be regarded as measured on a ratio scale. In the ideal setting the digital weighing machine generates birthweight data that are immediately reported to a computerized database. This procedure would generate perfectly recorded birthweights which fulfills the theoretical assumptions made about the data. If this is the case the use of parametric methods can be justified. Unfortunately, previous studies have discovered rounding or heaping in the birthweight variable in different contexts (Dellaportas et al., 1996; Edouard and Senthilselvan, 1997; Emmerson and Roberts, 2013), see Section 1.1.2. Rounding and heaping violates the assumption of continuous data.

Moreover, the phenomenon of rounding or heaping might be more present in contexts lacking proper weighing instruments as well as awareness of how rounding can constitute a statistical problem. In a comprehensive quality evaluation made by the author of the Obstetric database at Muhimbili National Hospital (MNH) in Tanzania an extensive problem with heaping in the birthweight variable is described (Appendix 4).

Leaving some of the epidemiologically interesting aspects behind, the aim with this study is to apply the extended Parametric Mixture of Logistic Regression approach on obstetric data from Muhimbili National Hospital and adjust for heaping. Moreover, even if this study does not focus on the epidemiological aspects some notes on the medical findings will be given.

1.1.2 The concepts of rounding, heaping and digit preference

The concept of rounding, heaping or digit preference has for a long time been recognized in the statistical and medical literature (see e.g. Bar and Lillard, 2012; Edouard and Senthilselvan, 1997; Emmerson and Roberts, 2013; Hasselstr¨om et al., 2013; Heitjan and Rubin, 1991;

Heitjan, 1989). Moreover, there are a few related concepts that to some extent imply the

same basic phenomenon. These are coarse data, rounding,heaping and digit preference.

(8)

’Rounding’ can be considered as an elementary form of coarse data where values are rounded approximately to the same extent (Heitjan and Rubin, 1991). ’Heaping’ denotes the phe- nomenon when values are rounded with different degrees of exactness (Wang, H.,2009) while

’Digit preference’ are non-uniform rounding for all values. Instead the rounding depends on how close the value is to a ’preferred’ value (Camarda, C.G. et al., 2008).

A simulated example of heaping is shown in Figure 1. The data is simulated from a mix- ture of two normal distributions and represent a possible birthweight distribution with data measured in kilos. All values are rounded to one decimal, representing rounding, and the chimneys represent preferred values. This mix of two different types of rounding represent heaping.

Figure 1: A simulated example of rounding and heaping in the birthweight variable using a two component Gaussian mixture model, n= 100 000

The implications of heaping are discussed in a range of areas, see Appendix 1, and this study will briefly discuss its implications for the analysis of birthweight-specific neonatal health.

Hence, this study will merge two fields that are somewhat overlapping, one that focuses

on the analysis of birthweight in relation to neonatal health and one that aims to correct

(9)

for heaping in the birthweight variable. With this background, the aim and the research questions are now given.

1.2 Aim and research questions

The main aim with this study is to apply PMLR framework for analyzing birthweight-specific neonatal health extended by Charnigo and colleagues to the data from the Obstetric database at MNH. From the quality evaluation it is evident that there is a problem with heaping.

Thus, the framework will also be adjusted for heaping by the method developed by Zhao and Bai (2012), see Section 2.5, in order to evaluate the influence of heaping on the parameter estimates.

Hence, the first part is to investigate birthweight-specific neonatal mortality and prevalence of risk factor such as HIV, private patient and preterm in the different components. The second part is to use the adjusted estimates to evaluate the influence of heaping. These two parts are connected as the second part gives some implications on the accuracy of the estimates in the first part. The research questions of this study are:

• Firstly, how can the components in a finite Gaussian mixture model be characterized by risk factors using data from the Obstetric database at Muhimbili Nationa Hospital, Dar es Salaam, Tanzania?

• Secondly, what are the implications on the parameter estimates in a mixture model for analyzing birthweight when not adjusting for heaping in the birthweight variable?

Moreover, the general goal is to give an empirical example of pitfalls and possibilities when

using data in a low-income setting.

(10)

2 Analyzing birthweight-specific neonatal health

This section presents some mathematical background and practical information needed in order to understand how to fit finite Gaussian mixture models and how to use the Parametric Mixture of Logistic Regressions to analyze birthweight-specific neonatal health.

2.1 The finite Gaussian mixture model

The finite-dimensional mixture model was originally developed by Karl Pearson as a method to model asymmetrical or skewed data (McLachlan and Peel, 2000). Pearson applied a two- component Gaussian mixture model as a tool to use the forehead to body length of crabs to distinguish between two subspecies within the overall population of crabs. One attractive feature of the mixture model is that it is semi-parametric as not one single parametric model is assumed for the whole population. Instead, the population is modeled as a mix of sub- populations of some specified distributions (McLachlan and Peel, 2000). Component mem- bership (such as species) is not always known and in these cases the iterative Expectation- Maximization (EM) algorithm is used to probabilistically assign component membership to each observation and achieve maximum likelihood (ML) estimates of the mixture parameters (McLachlan and Peel, 2000; McLachlan and Thriyambakam, 2000).

In Equation 1 the definition of the Gaussian mixture model with g components is presented.

f (y

j

) =

!

g i=1

p

i

× N(y

^j

| µ

ⁱ

, σ

i

) where N (y

j

| µ

ⁱ

, σ

i

) = f (y

i

| µ

ⁱ

, σ

i

) = 1

"

2πσ

²_i

e

⁻

(yj−µi)2 2σ2i

(1) where g denotes the number of components such that i = 1, . . . , g and n be the number of observations, j = 1, . . . , n. Moreover, let p

i

denote the mixing proportion which is always non-negative and smaller than unity so that the sum of all mixing proportions sum to unity.

Thus, the conditions in Equations 2 and 3 should be fulfilled (McLachlan , 2000).

(11)

0 ≤ p

ⁱ

≤ 1 (i = 1, . . . , g) (2)

and

!

g i=1

p

i

= 1 (3)

The definition of the mixture model, Gaussian mixture model and the general approach of estimation using the Expectation-Maximization algorithm is presented in Appendix 2 and is well described in (McLachlan and Peel, 2000; McLachlan and Thriyambakam, 2000).

The following subsections will further explain how parametric and semi-parametric versions of the Gaussian mixture model can be used when analyzing birthweight-specific neonatal health.

2.2 A semi-parametric method

A semi-parametric three component finite mixture model for measuring epidemiologically

useful characteristics of the birthweight distribution is presented by Wilcox and Umbach

(Umbach and Wilcox, 1996). This model assumes that the birthweight distribution is accu-

rately described by a main component following a normal distribution and consisting of all

regular births and two residual components following no known distribution and consisting of

high-risk births in the residual left and right tails of the main distribution. This approach is

regarded as semi-parametric since no distributional assumption is made on the two residual

distributions. The method by Wilcox and Umbach uses the birthweights compressed into

bins of 100-grams and is therefore more robust to rounding and heaping.

(12)

2.3 A parametric method to analyze birthweight-specific neonatal mortality

There are many factors that influence infant mortality, some which are observable and usually reported to Obstetric databases, such as the diagnosis of the mother, mother’s educational background and social status and complications during pregnancy. Other factors that might influence neonatal mortality such as the overall health of the mother and child, the lifestyle of the mother and nutritional intake are harder to observe and measure correctly. More- over, there might be additional factors that are latent and unobserved. In order to analyze birthweight-specific neonatal mortality by logistic regression one of the main assumptions is that the model is correctly specified and control for all factors that influence infant mortal- ity, that is , there is no neglected heterogeneity (Wooldridge, 2010, p.582). By including all factors we can regard the birth-cohort as homogeneous since the factors that induce hetero- geneity are controlled for. However, as described, there are many factors that need to be controlled for, observable as well as latent. The Parametric Mixture of Logistic Regressions has been developed by Gage (2009) with the purpose to use logistic regression within each component to control for unobserved heterogeneity (Gage, 2009).

Figure 2: Simulated birthweights, n= 5 000 (using R-code from Charnigo et.al. 2010a) (a)A two-component Gaussian mixture model, (b) A four-component Gaussian mixture model fitted

(13)

The Parametric Mixture of Logistic Regressions approach assumes that the birthweight dis- tribution can be correctly modeled by a two component normal mixture model. The two components are assumed to represent the babies undergoing normal and compromised fetal development. The first graph in Figure 2 shows a two-component Gaussian mixture model that is fitted to simulated data from a four-component mixture model of 5000 observations.

The second graph in Figure 2 depicts the simulated data modeled by a four component model.

One of the main interests when using Gaussian mixture models is to evaluate birthweight- specific neonatal mortality rates within each component. This analysis is possible to perform by a reformulated version of the logistic regression adopted for the setting when component membership is only probabilistically known (Gage, 2009). Since the component memberships are only probabilistically known it is assumed that the mortality rate of the population at the specific birthweight y

j

is a weighted sum of the mortality at the two components at birthweight y

j

. The standardized weights are defined in Equation 4 according to Gage (2009):

ˆ q(y

_j

) =

#

₁

ˆ p1

$ f (y

j

| ˆµ

¹

, ˆ σ

1

)

#

₁

ˆ p1

$ f (y

j

| ˆµ

1

, ˆ σ

1

) + #

₁

ˆ p2

$ f (y

j

| ˆµ

2

, ˆ σ

2

) (4)

The procedure contains two steps and thus, the weights in Equation 4 are defined as a function of birthweights conditioned on the Maximum-Likelihood parameter estimates of the mixture model, ˆ p

i

, ˆ µ

i

and ˆ σ

i

, i = 1, 2, where ˆ p

i

is the estimated proportion of the sample that belong to component i. As the first step is to estimate the parameters in the mixture model, the weights and the model parameters are regarded as posterior probabilities of component membership when later estimating the logistic regressions in every component.

The birthweight-specific mortality rates, denoted as h(y

j

), are estimated according to the following Equation 5 (Gage, 2009):

h(y

j

) =

% e

⁽^!²^l=0^ˆb^1l^(y^j⁾⁾

1 + e

⁽^!²^l=0^ˆb^1l^(y^j⁾⁾

q(y ˆ

j

) + e

⁽^!²^l=0^ˆb^2l^(y^j⁾⁾

1 + e

⁽^!²^l=0^ˆb^2l^(y^j⁾⁾

[1 − ˆq(y

^j

)]

&

(5)

The probabilities,

^e⁽

!2l=0 bil(yj ))

1+e⁽^!2^l=0^{bil(yj ))}

represent the component-specific probabilities of dying in the

two components. The logistic regression can be expressed as Equation 6 (Gage, 2009):

(14)

!

2 l=0

ˆb

1,l

(y

j

) = ˆb

1,0

+ ˆb

1,1

y

j

+ ˆb

1,2

y

_j²

(6)

where ˆb

1l

(y

j

) and ˆb

2l

(y

j

) are the component specific estimates for the covariates. Gage ar- gue that a second degree polynomial is a good choice as it captures the U-shape of the birthweight-specific mortality for each component which has been shown in previous studies on homogeneous birth-cohorts (Gage, 2009). The estimates of the component specific param- eters in the logistic regressions are fitted by minimizing the sum of the weighted probabilities in Equation 7 using the paired observations on mortality, x

j

, and birthweights, y

j

, across all births (Gage, 2009).

−ln

% e

^(x^j^!²^l=0^ˆb^1l^(y^j⁾⁾

1 + e

⁽^!²^l=0^ˆb^1l^(y^j⁾⁾

q(y ˆ

j

) + e

^(x^j^!²^l=0^ˆb^2l^(y^j⁾⁾

1 + e

⁽^!²^l=0^ˆb^2l^(y^j⁾⁾

[1 − ˆq(y

^j

)]

&

(7)

One very important aspect is that the Gaussian mixture model is heavily based on the assumption that the sub-populations are normally distributed. The Gaussian mixture model that is specified by Gage also assumes that the number of components are two. In reality, the actual number of components in the population of births is unknown. As it can be argued that the number of components can vary between populations, having a fixed number of components is problematic since it might ignore differences in the number of sub-populations between populations.

Charnigo and colleagues have extended the PMLR approach developed by Gage to a flexible number of components (Charnigo et al., 2010a). The generalized extension of the PMLR tech- nique allows for estimating the number of components using a Flexible Information Criterion (FLIC) developed by Charnigo and colleagues (Charnigo et al., 2010a). Another extension is to use the PMLR to understand how risk factors such as mothers’ diagnosis and gestational age (age of the fetus in weeks) can characterize the components in the mixture model (Deng et al., 2011).

Even though the problem with assuming that each component follows a normal distribution

is not discussed by Charnigo and colleagues their extension of the PMLR developed by

Gage is a novel and interesting approach to analyze neonatal health. Due to the problems

(15)

with unobserved heterogeneity it is interesting to apply the extended version of the model developed by Gage in order to analyze birthweight-specific neonatal health using the data from Muhimbili National University. The details of this framework will be explained next.

2.4 An extension of the Parametric Mixture of Logistic Regres- sions (PMLR)

The Parametric Mixture of Logistic regressions technique formulated by Gage and colleagues has been extended from containing 2 to g components in a finite Gaussian mixture model to by Charnigo and colleagues. The risk function or the birthweight-specific mortality for component i (1 ≤ i ≤ g) can be expressed as in Equation 8 (Charnigo et al., 2010b):

ˆ

r

i

(y

j

) = e

⁽^!⁴^l=0^ˆb^i,l^(y^j⁾⁾

1 + e

⁽^!²^l=0^ˆb^i,l^(y^j⁾⁾

(8) where y

j

represents the birthweight for individual j and '

4

l=0

ˆb

i,l

(y

j

) represents the logistic regression function for every birthweight (Charnigo et al., 2010b). In the two component case (Gage, 2009) it was assumed that a second-degree polynomial could describe the mor- tality curves for each component. However, Charnigo et al. argue that letting ˆb

i,l

(y

j

) to be a four-degree polynomial allows for up to two changes in convexity for each birthweight- specific mortality curve (Charnigo et al., 2010b). Thus, the logistic regression function to be estimated is formulated as in Equation 9 (Charnigo et al., 2010b).

!

4 l=0

ˆb

i,l

(y

_j

) = ˆb

_i,0

+ ˆb

_i,1

y

_j

+ ˆb

_i,2

y

²_j

+ ˆb

_i,3

y

_j³

+ ˆb

_i,4

y

_j⁴

(9)

where i = 1, . . . , g and j = 1, . . . , n. For all g components the population overall risk-function can be expressed as in Equation 10 by using the law of total probability:

'

g

i=1

r ˆ

i

(y

j

)ˆ p

i

f (y

j

; ˆ µ

i

, ˆ σ

i

) '

g

i=1

p ˆ

_i

f (y

_j

; ˆ µ

_i

, ˆ σ

_i

) (10)

(16)

The parameters ˆ r

i

(y

j

) in Equation 10 are estimated using the optimization (optim in R, R foundation for statistical computing, Vienna, Austria, 2009) by conditioning on the first step where the parameters in the normal-components have been estimated by the EM-algorithm and optimization. As was noted before, one of the fundamental differences in the approach developed by Charnigo et al. and Gage et al. is that the former allows for selecting the number of components that show the best fit to the data. The assessment of ’best fit’ is to use a penalized likelihood function as an information criterion. Charnigo and Pilla (Pilla and Charnigo, unpublished) have developed the Flexible Information Criterion (FLIC) that is used to select the number of components. This is presented shortly in the next section.

2.4.1 Order of the Gaussian mixture model

The procedures to select the order of a finite mixture model are many and can roughly be divided into formal tests and a criterion based on a penalized likelihood (McLachlan p.220, Li 2012). In the new extended framework that generalizes the PMLR procedure the Flexible Information Criterion (FLIC) (Equation 11) is used as the criterion for selecting the number of components (Charnigo et al., 2010a). This criterion is developed by Pilla and Charnigo (Pilla and Charnigo, unpublished) and is formulated in Equation 11:

F LIC

g

= −2logL

g

+ 2(log √

n)

^B(n,δ)

(3g − 1) (11)

where g denotes the number of components, n the sample size and the function B(n, δ) is explained in Equations 14 and 15. Two other commonly used information criteria are the Akaike Information Criterion (AIC), Equation 12, and Bayesian Information Criterion (BIC) ,Equation 13.

AIC

g

= −2logL

g

+ 2(3g − 1) (12)

BIC

_g

= −2logL

g

+ logn(3g − 1) (13)

(17)

The difference between the criterias in Equations 12, 13 and 11 is the penalty term B(n, δ) in Equation 11. This term is not only determined by the sample size but also by the con- figuration of the data points so that more heterogeneity in the sample allows for selecting a model with more components. This is approximated by the penalty statistic that uses the fraction of between-component variability to total variability for the estimated g-component model as described in Equation 14 (Pilla and Charnigo, unpublished)

δ(Y ) = 1 − 1 g

!

G g=2

( 1 S

²

!

g i=1

ˆ

p

i,g

)ˆ θ

i,g

− ¯ Y *

2

+

(14)

where g denotes the number of components and G the model with the maximum number of components. Let ˆ θ

i,g

denote the mean response function. Moreover, ¯ Y and S

²

are the sample mean and variance and ˆ p

_i,g

is the estimated mixing proportions for each model. The term B(n, δ) in Equation 11 is denoted as the ’bivariate ratio function’ and is defined in Equation 15:

B(n, δ) = Φ[(log √

n)

^δ

] − Φ(1)

1 − Φ(1) (15)

where Φ( ·) denotes the standard Gaussian cumulative distribution function. The bivariate ratio function in Equation 15 is defined for integers n > exp(2) and real δ ∈ [G

⁻¹

, 1].

Moreover, it should be noted that B(n, δ) is nonnegative and increasing in both n and δ (Pilla and Charnigo, unpublished).

Charnigo and colleagues have chosen FLIC as the main model selection criterion in their PMLR framework (Charnigo et al., 2010a). The reason for this is that a simulation study comparing the three criterias; Akaike Information Criterion (AIC) (12), Bayesian Information Criterion (BIC) (13) and FLIC has shown that while BIC and FLIC performed better for large samples FLIC and AIC performed better for small samples (Charnigo et al., 2010a).

Thus, in comparison with AIC and BIC FLIC performs better.

It should be kept in mind that FLIC depends on the sample size in combination with the

number of true components (Charnigo et al., 2010a). A possible solution to this problem is

(18)

to resample from the original sample with a fixed sample size of 50 000 observations so that the result from the FLIC can be comparable with other populations (Charnigo et al., 2010a).

This is employed in the further generalized approach by Deng and colleagues (Deng, et al., 2011).

We continue with the next section which will give a brief explanation of how to access the uncertainty in the mixture parameter estimates as well as how to construct confidence intervals and compare components.

2.4.2 Confidence intervals

The main drawbacks of the EM algorithm is that it does not produce an analytical variance and covariance matrix of the estimates such as the Fisher’s scoring algorithm and similar methods do (McLachlan, 2000). In the framework developed by Charnigo and colleagues the problem of no analytical covariance matrices is solved by using resampling. As Charnigo and colleagues have worked with large sample cases (more than 200 000 observations) the computation of the EM-algorithm is slow when resampling from the whole dataset. Instead, resampling with replacement is made on n=50 000 observations. These resamples are then used to select the the number of components using FLIC and assess uncertainty in the pa- rameter estimates (Deng et al., 2011). Since 50 000 observations are drawn with replacement Charnigo and colleagues have formulated approximate 95 % confidence intervals that account for overlapping samples (Charnigo et al., 2010b).

The confidence intervals are estimated in the following way in the large sample case. The

uncertainty in parameter estimates is estimated by resampling a sample of n=50 000 outcome

pairs of birthweight and mortality (or any other risk factor) N

_rep

times with replacement. The

mixture parameters as well as the parameters in the logistic regression are then estimated

using these meta samples, samples of a sample (Charnigo et al., 2010b). Let θ represent

any parameter to be estimated for making inference. For each one of the N

rep

samples the

parameter is estimated giving the meta sample, ˆ θ

₁

, . . . , ˆ θ

_N_rep

, of which a meta sample mean

is estimated ˆ θ

meta

(Equation 16), (Charnigo et al., 2010a).

(19)

θ ˆ

meta

=

'

Nrep

l=1

θ ˆ

l

N

rep

(16)

where N

_rep

represents the number of resamplings. The corresponding meta sample standard deviation ˆ S

θ

is constructed according to Equation 17:

S ˆ

θ

=

,'

Nrep

l=1

(ˆ θ

l

− ˆθ

^meta

)

²

(N

_rep

− 1) (17)

The 95 percent confidence interval can then be constructed as in Equation 18.

θ ˆ

meta

± C

⁰

× S ˆ

θ

"

N

rep

(18)

In Equation 18, C

0

can be chosen as the 0.025-quantile of the standard normal distribution if the meta sample, ˆ θ

1

, . . . , ˆ θ

Nrep

, is approximately normally distributed by the Central Limit Theorem. If the sample is small but the population is assumed to be normally distributed the 0.025-quantile of the t-distribution with (N

_rep

− 1) degrees of freedom can be used. The more conservative alternative is to choose C

0

=

^√_0.05¹

= 4.47 based on Chebychev’s inequality (Charnigo et al., 2010a). The later alternative is the most suitable one since inference for mixture parameters needs a very large sample size is needed in order for asymptotic theory to apply (Xiang et al.,2005; McLachlan , 2000).

Moreover, let C

φ

denote the critical value that adjusts for overlaps in the resampling (Charnigo et al., 2010a, Additional file 1: Technical appendix). Let φ =

_Nⁿ

where N is the original sam- ple size and n is the sample size within each resample N

rep

. Then C

φ

can be defined thusly (Equation 19) :

C

φ

= C

0

, φN

rep

(1 − (1 − φ)

^N^rep

) (19)

However, if the number of observations within each resample is large and close to the total sample size, (n ≈ N) then C

φ

≈ C

0

"

N

rep

.

(20)

When the total sample size is relatively small (around 50 000 observations) there is no need to draw resamples of n < N since the computing time will not differ much from taking n = N . Thus, for the small sample case the bootstrap method can be applied. The bootstrap approach is more or less equivalent to the resampling method described by Charnigo and colleagues(Charnigo et al.,2010a) but using n = N (Chernick, 1999). Using the bootstrap implies that N observations are sampled with replacement N

rep

times and yield the following confidence interval when inserted in Equation 18 (Charnigo et al., 2010a):

θ ˆ ± C

0

× ˆ S

θ

(20)

Another issue is that the mixture model parameters have non-negligible bias. Charnigo and colleagues have adjusted the confidence interval by adding the estimated bias ˆ B

θ

to the interval in Equation 18. This bias term is estimated from simulations in which the esti- mated parameters from the mixture model, ˆ p

1

, ˆ µ

1

, ˆ σ

1

, . . . , ˆ p

g

, ˆ µ

g

, ˆ σ

g

are used in 5 simulations.

The simulated data are then used to estimate the parameters in the previously selected g-component Gaussian mixture. The 5 simulated estimates are then compared with the es- timates from the mixture estimates from the whole dataset, ˆ θ. The following Equation 21 present the estimate of the mean bias of 5 simulations (Charnigo et al., 2010a).

B ˆ

θ

= '

5

k=1

| ˆθ

sim

− ˆθ |

5 (21)

Where k = 1, ..., 5 and represent the number of simulations that are used to estimate the mean bias term B

θ

. The confidence interval that takes the bias into account when n = N can then be formulated as in Equation 22 by using Equation 20 (Charnigo et al., 2010a).

θ ˆ ± ( ˆ B

θ

+ C

0

× ˆ S

θ

) (22)

When making confidence bounds for the risk functions the procedure is analogous. For every

resample with the birthweight and risk factor outcome pair an overall estimate of the risk

(21)

function for component i is estimated for each of the N

rep

resamples according to Equation 8, ˆ r

i,1

(y

j

), . . . , ˆ r

i,Nrep

(y

j

) as in Equation 23 (Charnigo et al., 2010a).

ˆ

r

i

(y

j

) = logit

⁻¹

- N

_rep⁻¹

N

!

rep

s=1

logit {ˆr

^i,s

(y

j

) } .

= exp /

N

_rep⁻¹

'

Nrep

s=1

log (

ˆ ri,s(yj) 1−ˆri,s(yj)

+0

1 + exp /

N

_rep⁻¹

'

Nrep

s=1

log (

_ˆ_r

i,s(yj) 1−ˆri,s(yj)

+0 (23)

Where i = 1, . . . , g and g denotes the number of components. By fixing the birthweight to y = y

0

we can define the parameter as θ = logit {r

i

(y

0

) }. For the N

rep

resample the risk estimates are defined as logit {ˆr

i,1

}, . . . , logit{ˆr

i,Nrep

} and denoted as the meta sample, θ ˆ

1

, . . . , ˆ θ

Nrep

for a fixed birthweight, y = y

0

. From this meta sample a meta average, ˆ θ, and standard deviation, ˆ S

_θ

, are estimated. The confidence intervals are then constructed as in Equation 22. The bias is estimated analogously and according to Equation 21.

The number of replications, N

rep

, is still a crucial question. Charnigo and colleagues use large datasets of 200 000 births or more and in that setting they recommend to draw N

rep

= 25 replications by resampling with n=50 000. However, in their study a sample size of 200 000 births are denoted as small and the confidence intervals are wide for the extreme birthweights.

The dataset that is used in this study contains 40 340 observations and are only a fraction of 200 000. Thus, in this context this is a small sample situation. In order to improve the approximate 95 % confidence intervals N

rep

is selected to be as large as possible (limited by computing time, set to a maximum of 20 hours) which give approximately N

rep

= 250.

Theoretically, the minimum number of necessary bootstraps vary from situation to situation

(Davidson et al., 2000). Charnigo and colleagues have shown by a simulation study that 25

resamples can be enough for a large original sample size (Charnigo et al., 2010a). This aspect

should be studied further in order to improve inference in small sample situations. However,

in this thesis 250 replications will be made on the whole sample of 40 340 observations, thus

N

rep

= 250 and n = N is used.

(22)

2.5 The Gaussian mixture model for rounded data

One of the questions that will be explained in this study is how heaping and rounding can be adjusted for in the mixture context. In Section 1.1.2 the concepts of rounding, heaping and digit preference are clarified. A range of methods have been formulated to adjust for heaping, rounding and digit preference (see Appendix 1 for a literature overview). There are two especially interesting methods that have been formulated lately that deal with heaping, rounding and digit preference. The first one, the Composite Link Model (CLM) uses the Generalized Linear Model (GLM) framework to uncover the latent distribution of the heaped data. This is done by assuming a smooth latent distribution and modeling how the original data have been transferred to neighboring values using the Poisson distribution (Camarda et al., 2008; Eilers and Borgdorff, 2004; Thompson and Baker, 1981). The interesting aspect of this approach is the possibility to take the heaping mechanism into account in order to get a better fit to the model. However, this model gives an estimate for every possible value and for a large amount of possible values it might be computer intensive and thusly, more suitable for discret data.

The second method that has recently been developed for rounded and heaped data is for-

mulated by Bai and colleagues in a series of papers. They have developed a consistent and

asymptotically normally distributed estimate for rounded data in a time series setting (Bai et

al., 2009), for dependent data (Zhang et al., 2010) and for normal mixture models (Zhao and

Bai, 2012). Their work is founded on the work of Lee and Vardeman which has shown how to

construct asymptotically normal confidence intervals for rounded data (Lee and Vardeman,

2002). The work made by Bai and colleagues has focused on how to adjust for rounding in

existing methods. This way to adjust for heaping is a convenient alternative to compara-

tive methods (see Appendix 1). In a mixture setting Zhao and Bai (2012) have shown how

the EM-algorithm can be adjusted for rounding and heaping by defining the heaping interval

and incorporate this when estimating the mixture components. Since this method has proven

easy to incorporate when estimating mixtures and is less computer intensive compared with

the Composite Link Model (CLM) it will be used here. The method by Zhao and Bai (2012)

can be explained as follows.

(23)

Assume we want to estimate the parameters in a g-component mixture model, θ = (p

i

, µ

i

, σ

i

), using the Expectation-Maximization algorithm when we have heaped data (the EM-algorithm is described in Appendix 2). Let the heaping interval be denoted as h, the observed and heaped values be represented by y

j

for individual j with the probability density function g(y

j

| θ). Moreover, let the unobserved exact value for individual j be x

j

with the specified probability density function f (x

j

| θ). The relationship between the observed and rounded y

j

and the latent and exact x

j

is defined in Equation 24 according to Zhao and Bai (2012).

Y

j

= y

j

if y

j

− h

2 ≤ X

j

< y

j

+ h

2 (24)

where j = 1, . . . , n and n is the sample size.

In the case when y

j

is rounded on the interval h the probability of observing y

j

can be expressed as in Equation 25 (Zhao and Bai, 2012).

g(y

_j

| θ) =

1

yj+h/2 yj−h/2

f (x

_j

| θ) (25)

If x

_j

is assumed to come from a mixture of normal distributions as in Equation 26.

f (x

j

| θ) = f(x

^j

| µ

ⁱ

, σ

i

) = 1

"

2πσ

_i²

e

⁻

(yj−µi)2

2σ2i

(26)

where i = 1, . . . , g and g denotes the number of components in the mixture. The Log- Likelihood function for the rounded data can be expressed as Equation 27 (Zhao and Bai, 2012).

&

j

(θ) =

!

n j=1

log(g(y

j

, θ)) (27)

By using the definition of the probability density of the rounded data in Equation 25 the

Log-Likelihood in Equation 27 can be adjusted for rounding and heaping. The definition of

the density function in Equation 25 is then used in the EM-algorithm to estimate the mixture

(24)

parameters θ. The derivation of this estimation procedure is described in detail in Zhao and Bai (2012).

The method has been shown to work well to correct for rounding in data and give consistent and unbiased estimates when applied in the EM-algorithm for the mixture model (Zhao and Bai, 2012). Moreover, the following sections will describe the data and the results.

3 Data

Many projects have been made with the aim to improve data quality when databases exist or to establish systems for data collection and management when they do not exist. One such project is the Obstetric database at Muhimbili National Hospital in Dar es Salaam, Tanzania. The Obstetric database was founded in 1998 and was a co-operation project between Uppsala University, Ume˚ a University and Muhimbili University of Health and Allied Sciences (MUHAS). The goal was to establish a research database as a first step in the development of a birth register at Muhimbili National Hospital. Over the years the database has been developed and the need for new variables, methods of measurement and reporting has changed. The Obstetric database has been collecting data for over 15 years and during the period 1999-2013 a total of 174 064 deliveries were recorded in the register.

A comprehensive quality evaluation made by the author on selected variables related to the analysis on neonatal health (Appendix 4) showed that the birthweights were heaped, rounded to different extent, and that there are problems in the assessment of gestational age. One other problem is that the survival of the neonate after admission is not followed up. Because of this the analysis is limited to ’very early neonatal mortality’ that is, if the neonate survives the first few hours or days of life. Lacking procedures to collect follow up data on the survival of the neonate implies that the mortality rates are underestimated.

The observations that have been selected are singletons that have been classified as born alive

and admitted or discharged, fresh stillbirths and early neonatal deaths. Thus, macerated

stillbirths will not be used in the analysis since the fetus has died in the womb some time

(25)

before delivery. In the years before 2004 macerated and fresh stillbirths were both reported as stillbirths. During 2004 there are still some cases in this category and because of this the years after 2004 are selected. However, the database has been under construction and the assignments of diagnosis as well as automatic controls for the data imputation has changed over the years. Due to this it is reasonable to select later years. Due to the improved quality over time the years 2009 to 2013 are selected for the analysis, yielding 40 340 complete observations for the final dataset. A comparison of the neonatal mortality rates is also made with the whole dataset. In Figure 3 the final dataset is described by a bar chart.

According to the definition given in Section 1.1.2 heaping denotes a mixture of values that are heaped to a varying extent. In Figure 3 it is evident that the data suffers from heaping since most of the values are rounded to the 100-gram bin while a small fraction of the values are seemingly exact. The values that can be assumed to be ’non-rounded’ are shown as a small sub-distribution in the bottom of the distribution in Figure 3.

In Figure 4 the data has been divided into those values that are rounded to the 100-gram

bin, graph (a), and those who are exact, graph (b). Of course not all values that have a

value ending with two zeros are rounded, however, here it is approximated to be so. The

proportion of values that end with two zeros are 86.79 percent. Thus, the absolute majority

of the reported birthweights end with two zeros and the rounding interval will be assumed

to be 100 grams. That is, the value h in Equation 24, Section 2.5, is assumed to be 100.

(26)

050010001500200025003000

Birthweights 2009−2013

n=40 340 Birthweight (kg)

Count

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6

Figure 3: Birthweights in the final dataset, 2009-2013, n=40 340

Figure 4: Heaping in the birthweight variable for data 2009-2013. (a) Distribution of the birthweights that are rounded to the 100-gram bin. (b) Distribution of the birthweights that are not rounded to the 100-gram bin.

(27)

3.1 Variables

The final dataset consists of babies born between the years 2009 to 2013, without those fetuses that are classified as macerated stillbirths. In Table 1 the selected variables are presented with year-specific proportions of each factor. All observations with missing values on any of the variables presented in Table 1 are excluded.

Year 2009 2010 2011 2012 2013 Average

Early neonatal mortality* 6.05 % 5.99 % 5.50 % 4.04 % 3.98 % 5.20%

Anemia 1.74 % 2.99 % 2.39 % 2.18 % 2.26 % 2.31 %

Malaria 0.11 % 0.10 % 0.03 % 0.04 % 0.00 % 0.06%

Diabetes 0.21 % 0.05 % 0.13 % 0.08 % 0.19 % 0.13 %

HIV 7.37 % 4.99 % 4.24 % 3.83 % 2.93 % 4.79 %

Hypertension 5.54 % 5.78 % 4.98 % 5.63 % 5.85 % 5.53 % Private patient 26.14 % 21.49 % 14.34 % 20.91 % 17.40 % 20.12 % Born after 35 weeks** 4.78 % 5.71 % 4.75 % 4.34 % 5.03 % 4.92 %

Births (n) 8953 8163 9123 7421 6680 40340

Table 1: Percent of early neonatal mortality and potential risk factors for every year. (*Early neonatal mortality exclude macerate stillbirths, ** after 35 weeks of completed gestation)

In Table 1 the percentage of early neonatal mortality (with macerated stillbirths excluded)

and other potential risk factors are described. The two risk factors that show the lowest

presence in the data are malaria and diabetes. The low rates might not be the result of

actual low rates but rather lacking procedures to test mothers for malaria and diabetes

before giving birth. Thus, malaria and diabetes are probably underestimated and due to

this they are not included in the analysis. A previous study conducted at the labor ward at

Muhimbili Hospital has reported a rate of 6.4% of 1 721 mothers tested for malaria during

2002-2003 (Kidanto et al., 2009). A reported number of 6.4% is considerably higher compared

with the percentages reported for each year in Table 1. Testing the mother for HIV is made

on a regular basis since mothers with HIV should be given special medical treatment in order

to prevent transmission of HIV to the fetus during labor.

(28)

A rough proxy for social status is if the mother is a private patient or not. Moreover, gestational age (the age of the fetus in weeks) is dichotomized into ’preterm’ here defined as a baby born before week 36. The definition of preterm is babies born before week 37 but due to mis-classifications of gestational age there is quite probable that many of the births recorded as born in week 36 are actually born in week 37 (Kidanto et al., 2009), see the quality evaluation in Appendix 4. One reason for this is that a medical examination of fundal height has been used to assess gestational age and at week 36 the fundus has reached its maximal height. Thus, weeks after the 36th are hard to assess using this method and probably many of those babies that have been classified as born in week 36 might actually be older. This is indicated by looking at the gestational-age specific birthweights, see Appendix 4. Detailed information on the birthweight variable, gestational age as well as missing values for some related variables are presented in Appendix 4.

Ethical clearance for using the data in order to perform a quality evaluation of the database

as well as conducting this analysis of birthweight-specific neonatal health was given by the

Research and Publications committee at Muhimbili University of Health and Allied Sciences

(MUHAS).

(29)

4 Results

The results of this study are presented in this section. First, the impact of adjusting for heaping is described. Secondly, the selected model is presented and the risk factors preterm, anaemia, private patient and HIV are depicted for each component using the Parametric Mixture of Logistic Regressions (PMLR). The analysis has been made using the statistical software R and Professor Richard Charnigo at University of Kentucky, U.S, has provided me with the original codes for the functions and graphs.

4.1 Comparison of adjusted and non-adjusted estimates and model selection

The following two subsections describe how the model selection criterion Flexible Information Criterion (FLIC), the estimated mixture parameters, their 95% confidence intervals, and mortality risk estimates are influenced when adjusting for heaping. The adjustment method developed by Zhao and Bai (2012) is described in Section 2.5, the heaping interval is assumed to be 100-grams.

4.1.1 Adjusting for heaping: FLIC to select the number of components

There are many criteria that have been developed to give a measure of the model-fit. The

FLIC-criterion is described in Section 2.4.1 and is the main tool for selecting the order of

the mixture model in the framework by Charnigo and colleagues. In the following Table 2

the Log-Likelihoods and the FLIC values are presented for the non-adjusted and adjusted

EM-algorithm and for models containing 2 to 7 components. The choice of a maximum 7

components is arbitrary but it is hard to theoretically justify the use of more components

according to previous epidemiological knowledge (Charnigo et al., 2010a). The model with

the lowest FLIC should be selected as the best model according to the framework by Charnigo

and colleagues (Charnigo et al., 2010a).

(30)

Non-adjusted Adjusted

Components: Log-Likelihood FLIC Log-Likelihood FLIC

2 -37834.78 75720.68 -37863.13 75778.15

3* -37785.95 75653.68* -37785.94 75654.91*

4 -37781.03 75674.50 -37781.14 75676.45

5 -37777.77 75698.66 -37781.31 75707.93

6 -37775.18 75724.15 -37790.24 75756.92

7 -37734.35 75673.15 -37795.47 75798.52

Table 2: Log-likelihoods and FLIC for models with 2-7 components and when adjusting for heaping

In Table 2 the FLIC is generally larger for the adjusted model. Hence, the FLIC might be underestimated when not adjusting for heaping.

For the adjusted model the Log-Likelihood does not monotonically increase for an increase in the number of components. The models with 4 and 5 components show a larger value of the Log-Likelihood compared with models with 6 or 7 components, see Table 2. This is problematic and should be studied further.

Moreover, for both the adjusted and non-adjusted cases a three component model is selected since the lowest value of FLIC can be found here. Next section describes the mixture param- eter estimates in a three-component model when the EM-algorithm is adjusted for heaping.

4.1.2 Adjusting for heaping: Parameter estimates

The following Table 3 shows a comparison of the estimates and 95% confidence intervals for a three-component mixture model with and without adjustment for heaping. The confidence intervals are based on bootstrap standard errors with 250 replications. In Equation 22, Section 2.4.2, we let C

₀

= 4 (recommended by Charnigo and colleagues ( 2010a)).

In Table 3 the mixture estimates of a three-component Gaussian mixture model are presented.

Let ˆ p

i

denote the mixture proportion for component i, ˆ µ

i

and ˆ σ

i

be the mean and standard

deviation that specify each normal mixture for component i = 1, 2, 3. The estimates in Table

(31)

Non-adjusted Adjusted

Component: Estimate 95 percent CI Estimate 95 percent CI

1 p ˆ

₁

0.08 (0.07; 0.09) 0.08 (0.07; 0.09)

ˆ

µ

1

1719.56 (1670.98; 1768.14) 1718.87 (1633.30; 1804.44) ˆ

σ

₁

491.20 (473.21; 509.20) 490.91 (458.77; 523.06)

2 p ˆ

2

0.90 ( 0.83; 0.97) 0.90 (0.81; 0.99)

ˆ

µ

₂

3104.49 (3096.45; 3112.53) 3104.59 (3092.47; 3116.70) ˆ

σ

2

499.80 (484.22; 515.38) 499.99 (475.52; 524.46)

3 p ˆ

₃

0.02 (-0.04; 0.08) 0.02 (-0.06; 0.10)

ˆ

µ

3

3878.72 (3069.46; 4687.98) 3897.27 ( 3069.23; 4725.30) ˆ

σ

₃

665.41 (533.47; 797.35) 661.58 (594.36; 728.79)

Table 3: Mixture parameter estimates for a three-component model when adjusting for heaping and not. With approximate 95 % confidence intervals based on bootstrap standard errors

3 are slightly different when adjusting for heaping. The change in the point estimates are small while the estimated 95% confidence intervals are wider for the adjusted parameter estimates. The wider confidence intervals imply higher standard errors when adjusting the EM-algorithm for heaping which in turn imply higher variability in the point estimates. This aspect is discussed further in the discussion part, Section 5.

Next section investigates how the differences observed in the adjusted and non-adjusted estimates (see Table 3) might influence the mortality risk estimates.

4.1.3 Adjusting for heaping: mortality curves and risk estimates

In the following Figure 5 graph (a) and (b) represent the estimated component-specific mor-

tality rates when adjusting for heaping. The vertical axis shows the mortality risk on a

logarithmic scale. In Figure 5 the two graphs show almost identical birthweight-specific mor-

tality curves, the dashed-lines. The main difference is that the approximate 95% confidence

intervals (solid lines) seem to vary in width for different birthweights. One such example can

be seen at around 2300 grams where the upper bound for component 3 crosses the mortality

(32)

curve for component 1 in graph (a). In Table 4 the point estimates of mortality risks are presented at 1000, 2000 and 3000 grams for component 1, 2000, 3000 and 4000 grams for component 2 and 2000, 3000 and 4000 grams for component 3. The point estimates are ex- pressed as risk of death per 1000 births and the confidence intervals are approximately 95%

(Charnigo et al., 2010b). Looking at Table 4, 6 out of the 9 estimated confidence intervals are wider for the estimates when the EM-algorithm is adjusted for heaping. This, even if the confidence bounds vary in size there seems to be a tendency for the confidence bounds for the adjusted estimates to be slightly wider.

Figure 5: Birthweight-specific risk of mortality (a) non-adjusted for heaping (b) adjusted for heaping. Solid lines are approximate 95 % confidence intervals. Resamples=250 and n=40 340.

Regarding the point estimates of mortality risk per 1000 births in Table 4 adjusting for heaping has practically no impact. However, the confidence intervals are slightly wider for some of the adjusted estimates. It should be kept in mind that the standard errors to the confidence intervals are drawn randomly and thus, there might be some variation due to randomness. These aspects can be studied further by using some other method than resampling to assess the standard deviation or by conducting a simulation study.

The issues with using bootstrap and resampling to assess standard errors will be further

discussed in the discussion part, Section 5. The next section will give a description of the final

three-component model when the EM-algorithm is adjusted for heaping using the method

(33)

Estimated mortality per 1000 births

Non-adjusted Adjusted

Component: Weight: Risk* 95% CI Risk* 95% CI

1 1000 g 484.40 (373.20; 597.30) 484.40 (379.80; 590.40) 2000 g 54.80 (17.30; 159.90) 54.80 (14.20; 189.40) 3000 g 3.30 (10.80; 34.30) 2.90 (10.80; 38.70) 2 2000 g 204.90 (123.60; 320.30) 204.90 (110.20; 349.10)

3000 g 31.00 (25.40; 37.60) 31.00 (25.60; 37.40) 4000 g 25.00 (11.20; 54.80) 25.00 (10.90; 56.20) 3 2000 g 8.30 (0.30; 170.30) 8.30 (0.40; 145.50) 3000 g 12.30 (2.70; 54.20) 12.30 (2.60; 55.20) 4000 g 17.10 (0.00; 899.50 ) 17.10 (0.00; 902.20)

Table 4: Birthweight-specific non-adjusted and adjusted estimates of mortality risk. Approximate 95 % confidence intervals.

Resamples=250 and n=40 340. (*Risk=logit⁻¹{ˆθ})

explained in Section 2.5 and with the defined heaping interval 100-grams.

4.2 Description of the final model using Mixture models and Para- metric Mixture of Logistic Regressions, PMLR

The following Figure 6, graph (a), shows the estimated three-component mixture model when heaping is adjusted for. The corresponding adjusted mixture estimates can be seen in Table 3. Let ˆ p

i

denote the mixture proportion for component i, ˆ µ

i

and ˆ σ

i

be the mean and standard deviation that specify each normal mixture for component i = 1, 2, 3.

In Figure 6 (a), the first component seems to represent the births in the fatter left tail well.

This component is estimated to contain 8 percent of the births while the main component

2 is estimated to contain 90 percent of all births. The third component is the smallest one

and is estimated to contain 2 percent of all births. However, the approximate 95 percentage

confidence bounds show that the proportion 0 lie within the interval and thus, it could be

(34)

the case that in the population there would be no third component. The component that has the largest estimated standard deviation is the third component , 661.58 grams 95% CI (594.36; 728.79). Both the first and second components have relatively similar estimated standard deviations on 491 grams 95% CI (458.77; 523.06) and 500 grams 95% CI (475.52;

524.46) respectively.

Estimates to a three-component Gaussian mixture model Component: Estimate 95 percent CI

1 p ˆ

₁

0.08 (0.07; 0.09)

ˆ

µ

1

1718.87 (1633.30; 1804.44) ˆ

σ

₁

490.91 (458.77; 523.06)

2 p ˆ

2

0.90 (0.81; 0.99)

ˆ

µ

₂

3104.59 (3092.47; 3116.70) ˆ

σ

2

499.99 (475.52; 524.46)

3 p ˆ

₃

0.02 (-0.06; 0.10)

ˆ

µ

3

3897.27 ( 3069.23; 4725.30) ˆ

σ

₃

661.58 (594.36; 728.79)

Table 5: Mixture parameter estimates for a three-component Gaussian mixture model when adjusting for heaping and not.

With approximate 95 % confidence intervals based on bootstrap standard errors. Using the adjusted EM-algorithm

The second graph (b) in Figure 6 depicts the estimated birth-weight specific mortality rates for the three components. A descriptive comparison between the components shows that the mortality curve for component 1 has a steeper slope compared with the mortality curve of the births estimated to belong to component 2. The difference in the slopes and that they appear to be somewhat parallel to each other indicates that the births in component 1 have a relatively lower of mortality for the same birthweights as the births that are estimated to belong to component 2.

In Figure 6 (b) the mortality rates are decreasing for higher birthweights in component 3.

One possible explanation for the decrease in mortality risk for high birthweight babies in

component 3 is that few babies have such a weight in the dataset for 2009-2013. Charnigo

and colleagues (2010b) have got similar results and explain it as an artificial decrease due to

(35)

Figure 6: A three-component Gaussian mixture model fitted to birthweights 2009-2013, n= 40 340. (a) Components, real data and mixture, (b) Estimated component-specific mortality curves and overall mortality. Using the adjusted EM-algorithm

small sample size. For the years 2009-2013 there are 321 babies having a weight of 4.5 kilos or more. Of these babies 13 died and of the 43 babies having a weight of 5 kilos or more only 3 experienced death.

Figure 7 depict the adjusted estimated mixture model and mortality curves for the complete dataset of 171 288 observations for the years 1999-2013. The problem with using the complete dataset is that macerated and fresh stillbirths can not be distinguished and thus, the mortality rate will be overestimated in relation to the estimates when the macerated stillbirths have been excluded (as for years 2009-2013). Moreover, this might be an explanation of the different directions of the mortality curves for the heavier babies in graph (b) Figure 7 and 6.

Birthweight-specific neonatal health: With application on data from a tertiaryhospital in Tanzania