Bayesian Model Selection with Intrinsic Bayes
Factor for Location-Scale Model and Random
Effects Model
Author:
Viktor Eriksson
(Date of Birth – 911001)
Semester: 4th semester in master’s program
Course name, level and credits: Independent Project II, Advanced, 15 Credits Subject: Statistics
Supervisor: Olha Bodnar Examiner: Hoang Nguyen
¨
Abstract
In this project the concept of Bayesian model selection have been introduced for non-informative priors. In particular the Intrinsic Bayes Factor (IBF) is used for comparison of two models: the location-scale model and the random effects model. These two models are the most used in meta-analysis and for the adjustment of fundamental physical constants such as Planck constant and Gravitational constant. IBF, defined as the ratio of predictive posterior distributions, has been de-rived for two models. In the case of the location-scale model the ana-lytical expression of the predictive posterior distribution is obtained, while in the case of the random effects model the posterior predictive distribution is presented as a one-dimensional integral only. This in-tegral is computed numerically using Simpson’s rule. The theoretical findings are applied to real-life data.
Contents
1 Introduction 3
1.1 The background . . . 3
1.2 The models . . . 5
1.3 A Bayesian approach . . . 6
1.4 Bayesian model selection . . . 7
2 Theory of Bayesian Model Selection 9 2.1 The Bayes Factor . . . 9
2.2 The Intrinsic Bayes Factor . . . 11
3 Objective Bayesian Analysis in Location-Scale and Random Effects Model 15 3.1 The Location-Scale Model . . . 15
3.2 The Random Effects Model . . . 20
3.3 Location-Scale Model Vs. Random Effects Model . . . 23
4 Numerical illustration 25 5 Empirical applications 27 5.1 Newtonian constant of gravitation . . . 27
6 Discussion 29 7 Appendices 30 7.1 A: Frequentist statistics Vs. Bayesian statistics . . . 30
7.2 B: The necessity for a proper prior distribution . . . 32
7.3 C: Training samples in the IBF . . . 33
7.4 D: The numerical integration . . . 34
7.5 E: The simulation . . . 36
7.6 F: Symbols and clarifications . . . 37
7.7 G: Data on the Newtonian constant of gravitation . . . 38
1
Introduction
In this introduction we start by explaining the background and purpose of the two models that we aim to make a selection between. Afterwards we consider the models themselves, namely the location-scale model and the random effects model. Then we consider a Bayesian treatment of these two models and the objective Bayesian approach to model selection. We end by discussing the Bayes factor. The aim of the thesis is to employ a method for objective Bayesian model selection, namely the intrinsic Bayes factor. This method is then used to make a selection between the location-scale model and the random effects model. There exists no software for an automatic compu-tation of the intrinsic Bayes factor. We code the compucompu-tational algorithms ourselves in R (given in appendix G) and derive the needed mathematical closed forms in this thesis.
1.1
The background
In metrology and in clinical studies there exists parameters of such impor-tance that a multitude of studies have been made to estimate these param-eters. For a physicists such an important parameter could be a natural constant, such as the Newton gravitational constant or the Planck constant (see, [Mohr et al, 2016] for a compilation of multiple measurements of these physical constants). In medicine it could be a log odds ratio that is relating patients who are suffering from carotid artery stenosis with those patients who suffered from stroke (see, [Meier et al, 2010] for a compilation of a mul-tiple such studies). These are all examples of important parameters for which multiple individual studies have been made for parameter estimation.
Each individual study reports an estimate of the parameter together with its uncertainty. In physics and medicine a parameter is estimated by tak-ing measurements in a laboratory. Great care is taken in laboratories to determine the measurement margin of error such that it accounts for every identifiable source of error. We can therefore assume that an individual study estimate is internally consistent, i.e., its uncertainty takes into account of ev-ery identifiable (hence explainable) source of uncertainty.
Although the individual studies are internally consistent, they often fail to be externally consistent. Because when the estimates together with their uncertainties from all individual studies are brought toghether in a scatter-plot, it is usually revealed that the dispersion of the estimates is substantially greater than what the reported uncertainties suggest them to be. The re-ported uncertainties takes into account every identifiable source of error in each individual study, yet these uncertianties fail to explain the full varia-tion between the estimates satisfactory. The extra variavaria-tion is caused by external inconsistancy, i.e., inconsistancy between the laboratories or inter-laboratory inconsistancy. We label this extra variation the heterogeneity. The heterogeneity could also be refered to as the excess variance or as the dark uncertainty [Thompson & Ellison, 2011]. It is important to note that the heterogeneity is unexplainable and cannot be resolved easily.
It is custom to combine the estimates from all individual studies into a pooled consensus value. Because if we let each new study replace the esti-mate from the previous study, we would then be wasting valuable data on the parameter. The undertaking to form a consensus value from individual stud-ies is called an interlaboratory study. Meta-analysis is an often used method for this purpose. Its objective is to combine these estimates into a consen-sus value, here denoted by µ, representative of all individual studies taken together. The reported uncertainties of the individual studies are collected into a covariance matrix here denoted by V. The diagonals of V contain the variances u2
i of the estimates and the off-diagonal entries contains covariances
ρijuiuj, where ρij is the correlation between the ith and the jth study (see
appendix F for an explanation of various symbols used in this thesis). In a meta-analysis it is important that the heterogeniety is accounted for when forming the consensus value or else the consensus value will be biased. But how can an unexplained heterogeneity be accounted for?
The ending question of the last paragraph leads us to the very two models of interest in this thesis, namely the location-scale model and the random effects model. Both of these models are used in a meta-analysis to form a consensus value while acounting for an unexplainable heterogeneity. In both of these models we let the heterogeneity be denoted by σ, although we index this parameter according to the model: σLS for the location-scale model and
σRE for the random effects model.
1.2
The models
Under the location-scale model the data generating process is described by X = µ1 + εLS, where εLS ∼ N (0, σLS2 V), 1 and 0 are column vectors of ones
and zeros respectively. This model as generally favored by physicists in pa-rameter estimation [Koepke et al, 2017, p.40]. This model is used to adjust for inconsistent data in [Weise & W¨oger, 2000] and in [Mohr et al, 2008] and is related to the Birge ratio, see [Birge, 1932].
Under the random effects model the data generating process is described by X = µ1 + λRE + εRE where λRE ∼ N (0, σ2REI) and εRE ∼ N (0, V).
This model is generally favored by medical researchers in parameter estima-tion [Koepke et al, 2017, p.40]. It has been used in chemistry for interlab-oratory comparisons in [Mandel & Paule, 1970] and in [Toman et al, 2012]. Other references on the application of the random effects model can be seen in [Bodnar et al, 2016, p.47]. In [Bodnar, Link & Elster, 2016] an objective Bayesian inference is proposed for the generalized random effects model, the reference prior distribution and the posterior distribution of µ is derived and then used to estimate the Planck constant.
In [Bodnar et al, 2016] both the location-scale model and the random effects model have been tested for robustness to model misspecification by computing coverage intervals. The test computes the Bayesian credible in-terval for the parameter µ by employing a simulated random sample. The experiment is set to generate a sample and then compute the credible interval with respective model, this is repeated 50000 times and then the coverage probability is defined as the success rate at which the credible interval succeed at covering µ. The results show that the random effects model is more robust to model misspecification than the scale model: when the location-scale model is true, but the random effects model is used to estimate µ, the coverage probability is roughly 95% which is the same coverage probability when the random effects model itself is true! However when the random effects model true, but the location-scale model is used to estimate µ, then the coverage probability is only roughly 0.45%, while when the location-scale
model is true then the coverage probability is 0.95%. If the location-scale model is actually true, then the random effects model will still do a good job at estimating µ. On the other hand, if the random effects model is actually true, then the location-scale model will make a bad estimate of µ. Note how-ever that both models have good coverage probabilities (≈ 95%) given that the used model is the true model.
1.3
A Bayesian approach
The location-scale model and the random effects model are treated within a Bayesian framework and therefore the unknown parameters of the mod-els (µ, σLS) and (µ, σRE) are assumed to be random variables described by
a probability density function (see appendix A for a brief discussion about the differences between frequentist statistics and Bayesian statistics). Both the models have in common the known positive definite matrix V (which is constant) and the unknown parameter µ. The parameter µ is the target parameter while the parameters σLS and σRE are nuisance parameters
rep-resenting heterogeneity for respective model.
Consider the frequentist viewpoint of parameter estimation and then com-pare this to the Bayesian viewpoint. Their main differences regard the in-terpretation of the parameter µ and the uncertainty of the estimate of µ [Koepke et al, 2017, p.44]. In the frequentist perspective, the parameter µ is an unknown constant. The frequentist makes an evaluation of the estimate of µ withing a repeatability context; that under imaginative repetitions of the experiment in which data are collected and the estimate of µ is computed, then the frequentist evaluates how the estimate of µ fluctuates from one sam-ple to another. Hence the frequentist viewpoint of uncertainty is based on the (objective) sampling variability. In the Bayesian-theoretical perspective µ is unknown but we may formulate our prior knowledge of µ in a prior prob-ability density function for µ. Then µ is viewed as a random variable and the prior distribution can then be used together with the observed data to infer the value of µ. The Bayesianist view that data as a means to update the prior distribution into a posterior distribution with a reduced dispersion of plausible outcomes for µ. The variation in µ is based on the priori knowledge and the observed data, not on repeatability. If the prior distribution of µ is subjective, than so is the estimate and uncertainty of µ partly subjective in
the posterior distribution. The Bayesian approach to parameter estimation can however be made objective with a noninformative prior distribution.
1.4
Bayesian model selection
There are various methods for objective Bayesian model selection, however the method usually have some limitation or difficulty about it. Because the method either requires that certain conditions are satisfied for it to work well or the method is difficult to implement in general. For example the Bayes factor by the conventional prior approach can be difficult in imple-mentation and there are no general method to derive the conventional priors [Berger & Pericchi, 2001, p.148]. The Bayes information criterion (BIC) can be used as an approximation to the Bayes factor for large sample sizes, how-ever this approximation is imbedded with difficulties for models with irregular asymptotics [Berger & Pericchi, 2001, p.153].
The intrinsic Bayes factor is our method of choice in this project and we properly present it in section 2.2. The benefit with the intrinsic Bayes factor is that it is automatic (objective) in the sense that it only depends on observed data and noninformative prior distribution. It may however be computationally intensive and unstable [Berger & Pericchi, 1996, p.120-121]. The intrinsic Bayes factor makes use of a minimal training sample for pa-rameter estimation. To make for a stable result it is conventional to take the average or median of the intrinsic Bayes factors computed over all pos-sible training samples. The fractional Bayes factor offers an alternative way of Bayesian model selection when noninformative priors are employed (see [O’Hagan, 1995, p.115]).
So why use a Bayesian approach to model selection? Some good points are stated in [Berger & Pericchi, 2001, p.138-140] and are as follows. The Bayes factor is interpreted as an odds factor, which is easy to understand. It is an automatic Ocham’s razors since it favor a simple model over a complex model. The Bayes factor is a consistent method, which mean that if any of the compared models is true, then the true model is selected (given some mild conditions are satisfied). The Bayes factor does not require models to be nested (one being a special case of the other), which make it possible
to compare different types of models in which the parameters may vary in dimension. Some of the difficulties with the Bayes factor is also stated in [Berger & Pericchi, 2001, p.142]. For example the Bayes factor can be dif-ficult to compute. The Bayes factor return indeterminate answers when we have different dimensions of the parameter spaces in conjunction with im-proper prior distributions. The Bayes factor does not usually return good answers when we use vague proper priors, we should instead use improper noninformative priors. The same parameter can have different interpretation in different models, and so the prior distribution for this parameter must be specified differently in different models. This may be counterintuative when we are formulating subjective priors.
The influence of the prior distribution on the Bayes factor is greater than it is on the estimation of a parameter. This is because the weight of the prior distribution is not reduced with an increasing sample size in the Bayes factor [Berger & Pericchi, 1996]. The Bayes factor is therefore particularly vulnera-ble to misspecifications in the prior distribution and for this reason we look for an objective noninformative prior distribution. [Berger, Bernado & Sun, 2009] developed the so called Berger and Bernando reference prior which is an ob-jective noninformative prior distribution and will be used in this thesis. The Berger and Bernado reference prior is recommended for use in the intrinsic Bayes factor for large sample sizes [Berger & Pericchi, 1996, p.121].
2
Theory of Bayesian Model Selection
2.1
The Bayes Factor
In this subsection we introduce the Bayes factor, its interpretation, and its use for model comparison. Let M1, . . . , Mi, · · · , Mq denote the models that
we aim to make a comparison between. Under model Mi the data X follow
the probability density function fi(x|θi) with θi ∈ Θi, to which we also refer
to as the likelihood function. In the Bayesian approach to model selection we start by assigning for each model Mi a prior probability P (Mi) of the model
Mi being correct and a prior distribution πi(θi) of the parameter θi. Then
the posterior probability P (Mi|x) of the model Mi being correct when the
sample x is observed is given by
P (Mi|x) = q X j=1 P (Mj) P (Mi) Bji(x) !−1 , where Bji(x) = mj(x) mi(x) = R fj(x|θj)πj(θj)dθj R fi(x|θi)πi(θi)dθi .
is the Bayes factor between Mj and Mi.
The posterior probability P (Mi|x) is however dependent on subjectively
chosen prior probabilities P (Mi) for i = 1, 2, . . . , q of the model Mi being
cor-rect. If one prefers the Bayesian model selection procedure to be objective, then she/he shall assign equal prior probabilities to the considered models, that is P (M1) = ... = P (Mq) = 1/q. In this case the posterior probability
P (Mi|x) used for the selection of a model will depend on the Bayes factors
Bji(x) only. In a special case, when q = 2, the Bayes factor Bji(x) can alone
be used to make an objective Bayesian model selection.
The Bayes factor is interpreted as the odds in favor of Mj against Mi
based on the evidence in the data x. The inequality Bji > 1 is supportive
of model Mj, while Bji < 1 is interpreted as evidence against Mj. Bayesian
model selection by the Bayes factor can be understood as a generalization of Bayesian hypothesis testing. We should note that the prior distributions
πi(θi) of the parameter θi are subjective because they are decided prior to
the data, but they are objective if we let them be noninformative.
The marginal distribution of the data under model Mi is given by
mi(x) =
Z
θi∈Θi
fi(x|θi)πi(θi)dθi .
It measures the probability of observing x under model Mi. As a rule of
thumb, we can say that a high probability of observing x under model Mi
corresponds to a good fit of the model to the data. This rule can easily be abused however, since for example a model can be introduced to overfit the data. Then the overfitted model has a very high value for mi(x), yet it
pre-dicts new values with a high uncertainty. It is remarkable that only models which can be supported by the underlying theory of the application should be compared and not arbitrary functional forms for the sake of maximizing mi(x).
The Bayes factor can also be interpreted as an odds factor if the marginal distribution mi(x) is proper, such that
R
x∈Ωmi(x)dx = 1 for all i = 1, . . . , q.
This requirement may seem strange, given that mi(x) as a probability density
function should always integrate to unity. However, this may not always be the case when an improper prior distribution, denoted by πN
i (θi), is used (see
appendix B for a detailed explanation). Such a situation is usually present when the Bayes factor is computed by using a noninformative prior, i.e., when objective Bayesian inference are preferable, since most noninformative prior distributions are improper. An improper prior distribution πiN(θi) cannot
have a normalizing constant and therefore it is not a true probability density function. If we multiply πN
i (θi) by an unspecified constant ci and determine
ci by normalizing πNi (θi), then we get
Z θi∈Θi ciπNi (θi)dθi = 1 ⇔ ci = 1 R θi∈Θiπ N i (θi)dθi = 1 ∞ = 0 .
The result of ci = 0 means from a practical point of view that there exists
no normalizing constant for an improper prior distribution πN
an improper prior distribution does not possess a normalizing constant, it is only meaningful to know the distribution up to a constant of proportion-ality. This means that the improper prior distribution can be written as πN
i (θi) ∝ hi(θi). One could however include some proportionality constant
ci and write πNi (θi) = cihi(θi), although ci does not normalize πiN(θi).
Consider computing the Bayes factor for two models Mj and Mi where
j 6= i. Model Mj have a likelihood function fj(x|θj) and an improper prior
distribution πN
j (θj) = cjhj(θj). Model Mi have a likelihood function fi(x|θi)
and a proper prior distribution πi(θi). Then the Bayes factor is given by
Bji(x) = cj
R fj(x|θj)hj(θj)dθj
R fi(x|θi)πi(θi)dθi
,
and it depends on an arbitrarily chosen value for cj. This is clearly a problem
since any value for cj can be picked up by the statistician which would greatly
alter the support of a model given by the data x. If model Mi also have an
improper prior distribution π2N(θi) = cihi(θi), then we have
Bji(x) = cj ci R fj(x|θj)hj(θj)dθj R fi(x|θi)hi(θi)dθi ,
which depends on the ratio between two arbitrarily chosen constants. This is a critical disadvantage with the application of the Bayes factor to model selection as the following citation from a textbook in Bayesian data analysis shows ”This fully Bayesian approach [Bayes factors] has some appeal but we generally do not recommend it because, in practice, the marginal likelihood is highly sensitive to aspects of the model that are typically assigned arbitrarily and are untestable from data.” [Gelman et al, 2014, p. 182].
2.2
The Intrinsic Bayes Factor
The intrinsic Bayes factor (IBF) is a solution to the problem that is imposed on the Bayes factor by the improper prior distribution. Namely that under an improper prior distribution the Bayes factor has no interpretation as an odds factor. The IBF is different from the ordinary Bayes factor in that it makes use of a training sample. The training sample is employed for the pur-pose of transforming the improper prior distribution to the proper posterior
distribution, which is then used as a prior for the rest of the elements in the sample. Because then the IBF is interpretable as an odds factor.
The training sample x`is a subset of x, x` ⊂ x, of size m. Let x(`) = x−x`
be the data set with the training sample removed. Then x` and x(`) forms a
partition of x. We have the following definition:
Definition 1. [Berger & Pericchi, 2001, p.149] ”A training sample, x`, is
called proper if 0 < mN
i (x`) < ∞ for all Mi, and minimal if it is proper and
no subset is proper”, where mN
i (x`) denotes the marginal distribution of x`
computed under the noninformative prior πiN(θi).
Why does a training sample make an improper prior distribution become proper? An explanation is that the training sample enables the improper prior distribution to better describe which values of θi are the more
proba-ble. Then the cumulative distribution function (CDF) of πN
i (θi|x`) converges
for all θi ∈ Θi and the posterior prior distribution πiN(θi|x`) is proper.
To understand this recall that the impropriety of a prior distribution fol-lows from it being noninformative. When we use a noninformative prior distribution we admit minimal knowledge of the true value of θi. For this
reason they are made to be vague or ’flat’ in distribution. But then the mass of probability becomes ’too much’ evenly distributed across the range of θi
in order for the cumulative distribution function of πN
i (θi) to converge for all
θi ∈ Θi (see appendix C for examples and further explanation). Oppositely,
as the information about θi increases then the tails of the distribution
di-minishes in size. Because when we have more information about θi, then the
mass of probability will accumulate in around the vicinity of the true value of θi. The distribution will be more mound-shaped rather than flat with
increasing information about θi. With small enough tails in the distribution
the CDF of πN
i (θi|x`) will converge for all θi ∈ Θi.
The purpose of a noninformative prior distribution is for the Bayesian approach to model selection to become objective. If we need the prior dis-tribution to be both informative and objective we have to use observed data for parameter estimation, replacing πN
lose data points to the specification of πN
i (θi|x`) that could have been used
for computing the Bayes factor. We are sacrificing a portion of the data set x, namely the training sample x`, for parameter estimation instead of using
it for model selection. We want as much data as possible for the computation of the Bayes factor since the overall goal is to make a comparison between two models. This is why we only want the minimal training sample.
The minimal training sample is usually of the same size as the number of parameters in the model. But there are exceptions to this rule, if for example the prior distribution is proper to some parameters then the mini-mal training sample size may be less then the number of parameters in the model [Berger & Pericchi, 1996]. As an example, suppose we have a joint prior distribution of two parameters that is improper, π(µ, σ2) ∝ 1/σ2, then
only two observations from the data set should suffice as a training sample. Which two observations should we pick for a training sample? There are n(n − 1)/2 combinations of choice to select two observations from x. We should pick every combination to compute the IBF, which implies we must make n(n − 1)/2 computations of the IBF for each such combination. Then we draw conclusions based on a summary of the differently computed IBFs, such as the average IBF or the median IBF.
Let the IBF be denoted by BjiI(x(`)|x`) since it depends on x(`) and is
conditioned on the training sample x`.
Definition 2. The intrinsic Bayes factor is
BjiI(x(`)|x`) = mN j (x(`)|x`) mN i (x(`)|x`) = R fj(x(`)|θj, x`)π N j (θj|x`)dθj R fi(x(`)|θi, x`)πiN(θi|x`)dθi .
There is an easier way to compute the IBF than using the definition. We may instead develop the IBF as follows
BjiI(x(`)|x`) = mN j (x(`)|x`) mN i (x(`)|x`) = m N j (x(`), x`) mN j (x`) mN i (x(`), x`) mN i (x`) = m N j (x) mN j (x`) mN i (x) mN i (x`) = m N j (x) mN i (x) × m N i (x`) mN j (x`) = BjiN(x) × BijN(x`).
Recall that the original problem with Bayes factors was that of improper prior distributions. We used two models Mj and Mi that has fj(x|θj) and
fi(x|θi) respectively, with improper prior distributions πjN(θj) = cjhj(θj) and
πiN(θi) = cihi(θi). When computing the IBF of Mj and Mi we have
BjiI(x(`)|x`) = BjiN(x) × B N ij(x`) = cj ci R fj(x|θj)hj(θj)dθj R fi(x|θi)hi(θi)dθi × ci cj R fi(x`|θi)hi(θi)dθi R fj(x`|θj)hj(θj)dθj .
The important point is that the arbitrary constants of proportionality cj and ci will cancel in the IBF [O’Hagan, 1995]. Hence the IBF does not
depend on anything arbitrary or subjective and can therefore be used for an objective Bayesian model selection.
We summarize this subsection by saying that a noninformative prior dis-tribution is used for an objective approach to Bayesian model selection. The IBF is a method in which the noninformative prior distribution is made proper by the use of a minimal training sample. The improper πN
i (θi) is then
replaced by the proper πN
i (θi|x`) and the rest of the data x(`) is employed
for model selection resulting in the IBF. The IBF measures the odds in favor of Mj against Mi based on evidence in the data, without the arbitrariness of
3
Objective Bayesian Analysis in
Location-Scale and Random Effects Model
In this section we will establish all theoretical results that are needed to compute the IBF for the location-scale model and the random effects model. We will start by presenting the location-scale model by a definition, we men-tion its use in statistics and then formulate necessary theorems for the final calculation of the IBF. We will treat the random effects model likewise and then end this section by comparing the two models and mention some of the literature on the subject.
3.1
The Location-Scale Model
The location-scale model is related to the Birge method, a popular method in physics and metrology for adjusting uncertainties to adapt for external inconsistency. The Birge method is used under the assumption that due to external inconsistency, i.e., heterogeneity, the reported uncertainty for each estimate is understated. The Birge method multiplies each uncertainty by a common factor that is called the Birge ratio which is a sort of averaging of uncertainties. Because then under the Birge ratio all the uncertianties are adjusted to be uniformly understated and hence mutually consistent. In [Bodnar & Elster, 2014] the Birge ratio is studied within a Bayesian frame-work in relation to the location-scale model.
Definition 3. Let V be a known positive definite matrix, µ an unknown location parameter and σLS an unknown scale parameter. Then the
location-scale model MLS of the data X is described by (i) and (ii).
(i) X = µ1 + εLS.
(ii) ε ∼ N (0, σ2 LSV).
The components (i) and (ii) of the location-scale model can be combined to form X ∼ N (µ1, σ2
LSV). This means that X follows a multivariate
nor-mal distribution with the mean µ1 and the covariance matrix σLS2 V. The positive definite matrix V contains all the reported uncertainties of the pa-rameter µ while σLS is an unknown factor that represents the unexplainable
the context of an application, while σLS is a nuisance parameter included to
the model to capture the heterogeneity. The idea is that the elements in V can be understated due to the presence of heterogeneity. By multiplying V by σ2
LS, the elements in V then becomes uniformly adjusted for by the
het-erogeneity. By using the notation X|µ, σLS, MLS we make it explicit that the
data is conditioned on the two parameters µ and σLS under the location-scale
model.
Definition 4. Under the location-scale model the data X follow the mul-tivariate normal distribution N (µ1, σ2
LSV) with probability density function
given by f (x|µ, σLS, MLS) = σLS−n (2π)n/2exp − 1 2σ2 LS (x − µ1)TV−1(x − µ1) .
For a full Bayesian treatment of the location-scale model we need the joint probability density function f (x, µ, σLS|MLS) = f (x|µ, σLS,MLS)π
N(µ, σ LS)
where πN(µ, σ
LS) is the noninformative reference prior of the location-scale
model. The Berger & Bernado reference prior has been derived for the general location-scale model in [Fern´andes & Steel, 1999] and is given by
πN(µ, σLS) ∝
1 σLS
.
In the derivation of the marginal distribution of X under the location-scale model we will use the following two lemmas.
Lemma 1. Let V be positive definite. Then it follows that
(x − µ1)TV−1(x − µ1) = xTQx + 1TV−11 µ − 1 TV−1 x 1TV−11 2 , where Q = V−1−V −111TV−1 1TV−11 .
Proof. We can show this by first evaluating the left-hand side of the equation. It holds that
(x − µ1)TV−1(x − µ1) = (xT − µ1T)(V−1
x − µV−11)
Completing the square for µ leads to 1TV−11 µ − x TV−1x xTV−11 2 + xTV−1x − (1 TV−1x)2 1TV−11 .
Notice that (1TV−1x)2 = xTV−111TV−1x because 1TV−1x is a scalar
and is therefore equal to its own transpose. Hence,
1TV−11 µ − x TV−1x xTV−11 2 + xTV−1x −x TV−111TV−1x 1TV−11 = 1TV−11 µ − x TV−1x xTV−11 2 + xTQx .
The proof is complete.
Lemma 2. Let n > 1 and let V be positive definite. Then it follows that
Z ∞ −∞ (x − µ1)TV−1(x − µ1) −n/2 dµ = (xTQx)−n/2 Γ n−12 p(n − 1)π q 1 n−1 xTQx 1TV−11 Γ n2 .
Proof. By lemma 1 we may rewrite the integrand as follows
(x − µ1)TV−1(x − µ1) −n/2 = xTQx + 1TV−11 µ − 1 TV−1x 1TV−11 2!−n/2 = xTQx−n/2 1 + 1 n − 1 µ − 11TTVV−1−1x1 q 1 n−1 xTQx 1TV−11 !2!−(n−1)+12 .
The integrand is proportional to the probability density function of a scaled and shifted t-distribution with n − 1 degrees of freedom, location parameter 1TV−1x
1TV−11 and scale parameter
q
1 n−1
xTQx
1TV−11. Using the fact that a
definite integral of a density function over its support must equal to unity, the statement of lemma 2 follows.
Theorem 1. Let n > 1 and let V be positive definite. Then under the location-scale model and reference prior πN(µ, σ
LS) = 1/σLS, the marginal distribution of X is given by m(x|MLS) = Γ(n−12 ) xTQx−n−12 det(V)1/22πn−12 √ 1TV−11 .
Proof. The marginal distribution of the data x is obtained by integration of the parameters µ and σLS in the joint probability density function f (x, µ, σLS|MLS).
It holds that m(x|MLS) = Z ∞ 0 Z ∞ −∞ f (x|µ, σLS, MLS)πN(µ, σLS) dµ dσLS = Z ∞ 0 Z ∞ −∞ det(V)− 1 2 (2π)n/2 (σLS) −n exp − (x − µ1) TV−1(x − µ1) 2σ2 LS 1 σLS dµ dσLS = σLS2 = ζ dσLS = 1 2√ζdζ σLS = 0 ⇔ ζ = 0 σLS = ∞ ⇔ ζ = ∞ = det(V) −12 2(2π)n/2 Z ∞ −∞ Z ∞ 0 (ζ)−(n2+1)exp − (x−µ1)TV−1(x−µ1) 2 ζ dζ ! dµ .
The inner integral in respect to ζ has an integrand that is the kernel of the density function of the inv-gamma(n/2, (x−µ1)TV−1(x−µ1)/2). Hence,
m(x|MLS) = det(V)− 1 2 2(2π)n/2 Z ∞ −∞ Γ n2 (x−µ1)TV−1(x−µ1) 2 n/2 dµ = det(V) −12 Γ n2 2πn/2 Z ∞ −∞ (x − µ1)TV−1(x − µ1)−n/2 dµ .
By lemma 2 the proof is finished.
Theorem 2. Let V be positive definite. Under the location-scale model and reference prior πN(µ, σ
LS) = 1/σLS, the size of a minimal training sample x`
is given by m = 2, i.e., x` = {xi, xj}, where xi, xj ∈ x and i 6= j. Moreover,
it holds that m(x`|MLS) = 1 2 q (det(V`))(xT`Q`x`)(1T2V −1 ` 12) ,
where V` is obtained from V by taking its elements lying on the intersections
of the ` rows and ` columns,
Q` = V−1` − V−1` 121T2V −1 ` 1T 2V −1 ` 12 with 12 = (1, 1)T.
Proof. We must show reductio ad absurdum that if m = 1 then m(x`|MLS) =
∞ while if m = 2 then 0 < m(x`|MLS) < ∞. Suppose that m = 1, then we
only use one observation xi ∈ x for a training sample. It holds that
m(xi|MLS) = Z ∞ 0 Z ∞ −∞ f (xi|µ, σLS, MLS)πN(µ, σLS) dµ dσLS = Z ∞ 0 Z ∞ −∞ u−1i (2π)1/2(σLS) −1 exp − (xi− µ) 2/u2 i 2σ2 LS 1 σLS dµ dσLS = u −1 i (2π)1/2 Z ∞ 0 (σLS)−2 Z ∞ −∞ exp − (xi− µ) 2 2u2 iσLS2 dµ ! dσLS .
The inner integral in respect to µ has an integrand that is the kernel of the N (xi, u2iσLS2 ) probability density function. Hence,
m(xi|MLS) = u−1i (2π)1/2 Z ∞ 0 (σLS)−2(2π)1/2uiσLS dσLS = Z ∞ 0 1 σLS dσLS = lim M →∞N →0lim ln|σLS| M N = ∞ − (−∞) = ∞ .
Since m(xi|MLS) diverges then one observation can not be used for
min-imal training sample. Suppose that m = 2, then we can use theorem 1 to show that m(xi, xj|MLS) converges and equals the following expression
m(x`|MLS) = 1 2 q (det(V`))(xTQ`x)(1T2V −1 ` 12) .
3.2
The Random Effects Model
Definition 5. Let V be a known positive definite matrix, µ an unknown location parameter and σRE an unknown scale parameter. Then the random
effects model MRE of the data X is described by (j), (jj) and (jjj).
(j) X = µ1 + λRE + εRE.
(jj) εRE ∼ N (0, V).
(jjj) λRE ∼ N (0, σRE2 I).
The components (j), (jj) and (jjj) of the random effects model can be combined to form the multivariate normal distribution X ∼ N (µ1, V+σ2REI). Within the context of application the quoted uncertainties in V are under-stated because of the presence of heterogeneity. In the random effects model an adjustment for this heterogeneity is made to V by adding a common term σ2
RE to every element in the diagonal of V, where σ2LS represents the
added by σ2
RE, then only the variances u2i are adjusted for by the
hetero-geneity while the covariances in V are left unaltered. By using the notation X|µ, σRE, MRE we make it explicit that the data are conditioned on the two
parameters µ and σRE under the random effects model.
Definition 6. Under the random effects model the data X follow the multi-variate normal distribution N (µ1, V+σ2
REI) with probability density function
expressed as f (x|µ, σRE, MRE) = det(V + σ2 REI) −12 (2π)n/2 exp −1 2(x−µ1) T V+σ2REI−1(x−µ1) .
The Berger & Bernado reference prior has been derived in [Bodnar, Link & Elster, 2016] and it is given by
πN(µ, σRE) ∝
q σ2
RE· tr((V + σRE2 I)−2) .
Theorem 3. Let n > 1 and V be positive definite. Then under the random effects model and reference prior
πN(µ, σRE) = πN(σRE) =
q σ2
RE· tr((V + σRE2 I)−2),
the marginal distribution of X is given by
m(x|MRE) = Z ∞ 0 (2π)−n−12 det(V + σ2 REI) −1/2 p1T(V + σ2 REI)−11 exp−1 2x TQ(σ2 RE)x πN(σRE) dσRE , where Q(σRE2 ) = (V + σRE2 I)−1− (V + σ 2 REI)−111T(V + σRE2 I)−1 1T(V + σ2 REI)−11 .
Proof. The marginal distribution of the data X is obtained by integration of
the parameters µ and σREin the joint probability density function f (x, µ, σRE|MRE).
m(x|MRE) = Z ∞ 0 Z ∞ −∞ f (x|µ, σRE, MRE)πN(σRE) dµ dσRE = Z ∞ 0 Z ∞ −∞ det(V + σ2REI)− 1 2 (2π)n/2 exp − 1 2(x − µ1) T V + σ2 REI −1 (x − µ1) × πN(σRE) dµ dσRE .
From lemma 1 with (V + σ2
REI) substituted for V, we can rewrite the
exponent by exp − 1 2(x − µ1) T V + σ2 REI −1 (x − µ1) = exp− 1 2x TQ(σ2 RE)x exp − 1 21 T V + σ2 REI −1 1 µ − 1 T V + σ2 REI −1 x 1T V + σ2 REI −1 1 2! = exp − 1 2x T Q(σRE2 )x exp − 1 2 µ − 1T V + σ2 REI −1 x / 1T V + σ2 REI −1 1 q 1T V + σ2 REI −1 1 −1 !2! .
The exponent is proportional to the kernel of the N (ω, τ2) probability
density function with location parameter ω = 1T V+σ2REI−1x / 1T V+ σ2 REI −1 1 and variance τ2 = 1T V + σ2 REI −1 1 −1
. Integrating the expo-nent in respect to µ over its range yield the following.
Z ∞ −∞ exp − 1 2(x − µ1) T V + σRE2 I−1(x − µ1) dµ =√2π exp−1 2x TQ(σ2 RE)x q 1T V + σ2 REI −1 1 .
The proof is completed.
In appendix B we show that the marginal distribution m(x) can only be proper if the prior distribution πN(θ) is proper. Hence, if we can show that
πN(µ, σRE|x`) is proper for some training sample x` of size m > 1, then by
the appendix B we know that m(x`|MRE) is proper and the minimal training
sample must be m = 2. It has been proven in [Bodnar, Link & Elster, 2016, p.32] that for the random effects model and the Berger and Bernado refer-ence prior, if n > 1 then πN(µ, σ
RE|x) is proper. We can therefore establish
a theorem about the minimal training sample without a proof, since it has already been implicitly proven in the mentioned article.
Theorem 4. Let V be positive definite. Under the random effects model and reference prior
πN(µ, σRE) = πN(σRE) =
q σ2
RE· tr((V + σRE2 I)−2),
the size of a minimal training sample x` is given by m = 2, i.e., x` = {xi, xj},
where xi, xj ∈ x and i 6= j. Moreover, it holds that
m(x`|MRE) = Z ∞ 0 (2π)−12 det(V`+ σ2 REI) −1/2 p1T 2(V`+ σRE2 I)−112 exp−1 2x T `Q`(σRE2 )x` πN(σRE) dσRE ,
where V` is obtained from V by taking its elements lying on the intersections
of the ` rows and ` columns, and
Q`(σRE2 ) = (V`+σ2REI) −1−(V`+ σ2REI) −11 21T2(V`+ σRE2 I) −1 1T 2(V`+ σRE2 I)−112 with 12 = (1, 1)T.
3.3
Location-Scale Model Vs. Random Effects Model
Both the location-scale model and the random effects model imply that the data X follow a multivariate normal distribution, that is X ∼ N (µ1, σ2
LSV)
and X ∼ N (µ1, V + σRE2 I), respectively. The difference between them is how they take into account for the heterogeneity. The choice stands be-tween letting the heterogeneity be modeled as an multiplicative correction
factor (location-scale model) or as an additive correction term (random ef-fects model).
We make a selection between the location-scale model and the random effects model according to which model that is favored by the IBF. The IBF for the two models is constructed in the following way
BMI REMLS(x(`)|x`) = B N MREMLS(x) × B N MLSMRE(x`) =m(x|MRE) m(x|MLS) × m(x`|MLS) m(x`|MRE)
We have derived the marginal distributions of the location-scale model in closed form in theorem 1 and theorem 2. However, the marginal distri-butions of the random effects model are only partly derived in theorem 3 and theorem 4 and need further evaluation by numerical calculation. This numerical calculation is only a minor issue since we can calculate m(x|MRE)
and m(x`|MRE) relatively fast and with good accuracy with the Simpson’s
rule as explained in appendix D.
Both the location-scale model and the random effects model have been shown to have a minimal training sample size of m = 2. With a total sample size of n, we have a number of n(n − 1)/2 combinations to select two observations out of n. Let X` = (x1`, x2`, · · · , x
l−1 ` , x
l
`) be the set of
training samples, there are l = n(n − 1)/2 number of elements in X`. We
will compute the IBF for every such possible selection and summarize the findings in an average IBF and a median IBF. The reason is that the mean and median of the IBF over X` has increased stability in comparison with
the IBF of an arbitrary xi
`. If the sample size n is either to small (too
slight stability improvement) or to large (too long computing time), it is then possible to use the expected mean discussed in [Berger & Pericchi, 1996, p.113-114], although we don’t use it here. Let the average IBF be defined by
Mean BMI REMLS(x(`)|x`) = B N MREMLS(x) × l −1 l X i=1 BMN LSMRE(x`) = m(x|MRE) m(x|MLS) × l−1 l X i=1 m(xi `|MLS) m(xi `|MRE) ,
while the median IBF is given by
Median(BMI REMLS(x(`)|x`)) = B N MREMLS(x) × Median B N MLSMRE(x`) = m(x|MRE) m(x|MLS) × Median m(x i `|MLS) m(xi `|MRE) .
4
Numerical illustration
In this section we test the performance of the IBF with simulated sam-ples from a location-scale model and random effects model respectively. We draw a sample from the location-scale model X = µ1 + εLS where
εLS ∼ N (0, σLS2 V) and then we compute the IBF with this sample. By
doing so we test if the IBF is consistent, i.e., that the IBF can predict cor-rectly which model is true. Then we draw a sample from the random effects model X = µ1 + λRE + εRE where εRE ∼ N (0, V) and λRE ∼ N (0, σRE2 I)
and test the IBF the same way (see appendix E for the details of the simu-lation procedure).
For the simulation we let V be determined by the data for the gravita-tional constant (See appendix F). We assume µ = 0 in both models without loss of generality and set n = 16. We let σLS = 1.3 and thus σ2LS = 1.69,
which means that for the location-scale model X ∼ N (µ1, σ2
LSV). Then the
understated uncertainties in V are enlarged by a factor of 1.69 to adjust for the heterogeneity. We let σRE = 0.0004 and thus σRE2 = 1.6 × 10−7, so that
the uncertainties in V are enlarged by adding 1.6 × 10−7 to each variance in the diagonal of V. So σLS = 1.3 and σRE = 0.0004 cover roughly about the
same range of variation in X. L. S. Model Mean BI MREMLS(x(`)|x`) Median BI MREMLS(x(`)|x`) No.1 0.9178607 0.1063782 No.2 1.936334 0.1658802 No.3 0.04904022 0.01851217 No.4 3.100906 0.6824315 No.5 0.003090198 0.001818678
Table 1: The IBF for five different simulated samples from the location-scale model. We put µ = 0, σLS = 1.3 and V contains reported uncertainties for
the Newtonian constant of gravitation G.
R. E. Model Mean BMI REMLS(x(`)|x`) Median BMI REMLS(x(`)|x`) No.1 85.16479 8.664935 No.2 356.2071 186.6131 No.3 355.8205 128.8503 No.4 66.82962 34.00741 No.5 24249.18 9591.932
Table 2: The IBF for five different simulated samples from the random effects model. We put µ = 0, σRE = 0.0004 and V contains reported uncertainties
for the Newtonian constant of gravitation G.
We expect the IBFs in table 1 to be less than one, since this corresponds to support of the location-scale model when it is true. We see in table 1 that the median IBF is much more accurate at predicting the location-scale model. The reason why is due to the presence of outliers which cause the mean IBF to be much larger, sometimes even larger than one. Out of 120 unique minimal training samples only a few very large outliers were present while most of the other cases the IBF were far below one. In table 2 we expect the IBFs to be larger than one, since this correspond to support of the random effects model when it is true. All five trials gave consistent results, they are however somewhat unstable like in table 1.
5
Empirical applications
5.1
Newtonian constant of gravitation
In the year 1687 Newton published his famous principia in which he pre-sented his thesis on the law of general gravitation. Newton proposed that the force that cause an apple to fall from a tree is the same kind of force that keep the moon to orbit the earth, this would be the force of gravita-tional attraction. In most textbooks in physics this law is written by an equation F = Gm1m2
r2 . The constant of proportionality, G, is called the
gen-eral constant of gravitation and is a natural constant. That means that this constant is the same throughout the whole universe. It determines the strength of the gravitational force, given mass and distance. Imagine two objects being put out in space far away from any stars and planets. Each object have the mass 1 kg and they are put at a distance of 1 m from each other, then the objects exerts a gravitational force on each other by exactly F = G ≈ 0.000 000 000 066 7 N = 6.67 · 10−11N. Compare this to the force of weight of a bottle of water of 1 kg which is 9.8 N. The weight of a feather, 1 gram, correspond to 0.0098 N. G / ( 10 − 11 m 3 k g − 1 s − 2 ) 6.671 6.672 6.673 6.674 6.675 6.676
NIST−82 LANL−97 BIPM−01 MSL−03 UZur−06 JILA−10 LENS−14 HUST−TOS−18 TR&D−96 UWash−00 UWup−02 HUST−05 HUST−09 BIPM−14 UCI−14 HUST−AAF−18
Figure 1: The gravitational constant G. The red diamonds are the estimates and the blue intervals are the uncertainties, data from [Merketas et al, 2019].
The average and median IBFs for the data on the gravitational constant are given by Mean(BMI REMLS(x(`)|x`)) = 0.8673046 , and Median(BMI REMLS(x(`)|x`)) = 0.4157501 . −1 0 1 2 3 REM to LSM Training sample log Intr insic Ba y es F actor
Figure 2: The IBF for each of the 120 minimal training samples. The y-axis display log(BI
MREMLS(x(`)|x`)). Then negative values correspond to an IBF
less than one, while positive values correspond to an IBF greater than one.
Both the average IBF and the median IBF clearly indicate the support of the location-scale model in comparison to the random effects model. As seen in figure 2, except for a few outliers the data consistently indicate that the IBF is less than one and, hence, support the location-scale model.
6
Discussion
In view of the gravitational constant G data set, the IBF clearly favor the location-scale model before than the random effects model. The mean IBF was only slightly under one and seem to be strongly influenced by outliers since most IBFs were far under one. It is typical for averages to be sensitive to outliers. Since the spectrum of IBFs in figure 2 shows only a few outliers, we think that the median IBF represents the overall IBF better than the average IBF.
The IBF is the ratio of marginal distributions for respective model and the marginal distribution is a measure for fit of the model to the data. A model that fits the data well does not necessarily predict future outcomes well. When a model is overfitted to the data the parameter uncertainties are large. It is our assumption that the compared models are not overfitted and that therefore the model with highest marginal distribution predict future outcomes best. In the introduction we discussed the robustness of respective model to model misspecification and it was found that if the location-scale model was the wrong model, then it made a bad estimate of µ. If on the other hand the random effects model was the wrong model, it still persisted in making a good estimate of µ (95% coverage probability). This finding gives support to the random effects model, not the location-scale model that is favored by the IBF.
In reality however, probably neither the location-scale model nor the ran-dom effects model is the actual ’true model’. Our results only shows that the location-scale model is favored compared to the random effects model, but what about other models? Both the location-scale model and the random ef-fects model are however good models for data inconsistency or heterogeneus data.
7
Appendices
7.1
A: Frequentist statistics Vs. Bayesian statistics
In frequentist statistics we specify a probability density function f (x|θ) of the data X given the parameter θ. The data x are known while the param-eter θ is unknown and considered to be dparam-eterministic. Typically we want to estimate θ by some estimator ˆθ which is a function of the data ˆθ = ˆθ(x). The estimator ˆθ is a random variable which has a sampling distribution, an expec-tation E(ˆθ) and a variance V (ˆθ). It is important to note that in frequentist statistics probabilities are attached to the estimator ˆθ(x), not the parameter θ. The frequentist confidence interval is said to cover the true value of θ by a probability 100(1 − α)%, where α is the significance level. This 100(1 − α)% probability has a repeatability interpretation; that 100(1 − α)% of the times that we collect data under identical conditions and compute the confidence interval will it cover the true value of θ. The confidence interval is an in-terval estimator, for example ˆθ ± λα
q ˆ
V (ˆθ), meaning that its boundaries are random. Confidence intervals are easily missunderstood. A typical mistake is to think that it is θ that lies inside the confidence interval by the prob-ability 100(1 − α)%, but this is simply false since θ actually does not alter between different values, it is constant. The frequentist principle is in qoute: ”In repeated practical use of a statistical procedure, the long run average actual accuracy should not be less than (and ideally should equal) the long run average reported accuracy” [Bayarri & Berger, 2004, p.60].
In Bayesian statistics we specify a probability density function f (x|θ) of the data x and a prior probability density function π(θ) for the parameter θ. The full probability distribution is then given by the joint distribution f (x, θ) = f (x|θ)π(θ). The data x is known and the parameter θ is unknown. The difference is that in Bayesian statistics the parameter θ is no longer considered a fixed constant as it is in frequentist statistics. In Bayesian statistics the parameter θ is a random variable with a probability density function π(θ). By using Bayes theorem you can obtain the probability dis-tribution of the parameter θ conditioned on the data x, called the posterior distribution p(θ|x).
p(θ|x) = f (x, θ)π(θ)
m(x) =
f (x|θ)π(θ)
R f (x|θ)π(θ)dθ ∝ f (x|θ)π(θ) .
The prior distribution π(θ) is specified before the sample x is collected and it corresponds to the statisticians prior beliefs on which values are prob-able for θ. The statisticians prior beliefs are not founded on observed data and are therefore subjective and likely to be biased. If the statistician wants to pick a prior distribution π(θ) objectively he may choose a noninformative prior distribution. The principle behind noninformative prior distributions is that they should only make ’vague’ probability statements about θ. We can interpret this as that no value for θ should be favored more or less than any other value for θ. The uniform distribution of θ satisfies this condi-tion. It is however common that you want to transform your parameters by some function. But if you do then the set uniformity of outcome is lost. The so called Jeffreys prior distribution is a solution to this problem. The Jeffreys prior is invariant to transformations of the parameter θ and is in this sense ’neutral’. Another kind of noninformative prior distribution is called the reference prior. The reference prior is determined so that the in-formation gained about θ in observing one observation x is maximized. This correspond to a minimally informative prior distribution about θ. Because if the prior distribution is much informative about θ, then we learn little about θ in observing x since much information was already contained in the prior distribution. The Kullback-Leibler divergence and the maximization of the Shannon information are involved in the definition of the reference prior [Berger, Bernado & Sun, 2009]. There is a strong relationship between the Shannon information and the concept of entropy, and an interesting perspec-tive on the reference prior is that it can be thought of as having achieved the highest entropy (fragmentization and disorder) of all the prior distributions.
In Bayesian statistics we estimate θ by the estimator ˆθ = Ep(θ|x)(θ|x).
Alternatively we may use the median or the mode of the posterior distri-bution p(θ|x) to estimate θ. This means that in Bayesian statistics we use the posterior distribution to estimate θ and we consider θ to be a random variable. In frequentist statistics we would instead have used the sampling distribution of the estimator ˆθ(x) to estimate θ, where then θ is considered to be a nonrandom constant.
In Bayesian statistics we use the credible interval as an interval estimator of θ, not confidence intervals. The credible interval is an interval such that P (θL < θ < θU|x) = (1 − α), hence the credible interval is not uniquely
defined. A common choice is to simply let the credible interval be described by Ep(θ|x)(θ|x) ± λαpVp(θ|x)(θ|x) in resemblance with the frequentist
con-fidence interval. The credible interval has fixed boundaries, it is θ that is random and lie inside the credible interval by the probability 100(1 − α)%. Interestingly, this is the interpretation that is commonly given to confidence intervals, which is entirely erroneous. However, the fact that it is common to mistake confidence intervals of being credible intervals speaks in favor of credible intervals being more ’natural to the mind’ than are confidence in-tervals. Ultimately, frequentist statistics has historically been favored before Bayesian statistics because the former is conditioned on less assumptions and is computationally easier than the latter. But with the advancement of computer devices and the introduction of objective Bayesian approaches, Bayesian statistics has become a school of thought to be reckoned with.
7.2
B: The necessity for a proper prior distribution
A prior distribution is said to be proper when R π(θi)dθi = 1 (or in general
R π(θi)dθi < ∞). To see the reason why prior distributions must be proper
when using Bayes factors consider the following. Z x∈Ω mi(x)dx = Z x∈Ω Z θi∈Θi fi(x|θi)πi(θi)dθidx = 1 .
Since fi(x|θi) is a probability density function it integrates to unity in
respect to x. Change the order of integration.
Z θi∈Θi Z x∈Ω fi(x|θi)πi(θi)dxdθi = Z θi∈Θi πi(θi) Z x∈Ω fi(x|θi)dx dθi = Z θi∈Θi πi(θi)dθi = 1 .
Hence the prior distribution must be proper if the marginal distribution is to integrate to unity.
7.3
C: Training samples in the IBF
In section 2.2 we discussed IBFs and minimal training samples. Here fol-lows some discussion and examples to clarify the concepts therein presented. Consider the following two definite integrals a) and b):
a) lim L→∞ Z L 1 1 x dx = limL→∞ ln|x| L 1 = ∞ . b) lim L→∞ Z L 1 1 x2 dx = limL→∞ − 1 x L 1 = 1 .
Why does the definite integral a) diverge whose integrand is 1/x while b) converge whose integrand is 1/x2? The reason is that the integrand of a)
does not go to zero fast enough with increasing values for x. In view of the integral being a sum, the cumulative contributions of 1/x for ever increasing L is too large for the integral to be upper bounded and hence the integral diverges. While the integrand 1/x2 in b) does go to zero fast enough with
increasing values of x and hence the integral converges to unity.
With this in mind, consider a case when x1, . . . , xn iid
∼ N (0, σ2), then the
noninformative reference prior of σ is πN(σ) ∝ 1/σ where σ ∈ (0, ∞). Let
us check if the reference prior integrates to unity:
Z ∞ 0 πN(σ) dσ = Z ∞ 0 1 σ dσ = limM →∞N →0lim ln|σ| M N = ∞ − (−∞) = ∞.
Now, let us use one observation xi from the data set as a training sample
and see how the posterior prior distribution πN(σ|x
i) is different from πN(σ). πN(σ|xi) ∝ f (xi|σ)πN(σ) = 1 √ 2πσexp − x 2 i 2σ2 1 σ ∝ (σ)−(1+1)exp − x 2 i/2 σ2 .
After a suitable change of variables the final expression can be seen to be the kernel of the inv-gamma(1/2, x2
Z ∞ 0 πN(σ|xi) dσ = 1 √ 2π Z ∞ 0 (σ)−(1+1)exp − x 2 i 2σ2 dσ = σ2 = ζ dσ = 1 2√ζdζ σ = 0 ⇔ ζ = 0 σ = ∞ ⇔ ζ = ∞ = 1 2√2π Z ∞ 0 ζ−(12+1)exp − x 2 i/2 ζ dζ = 1 2√2π Γ(12) (x2 i/2)1/2 = 1 2xi < ∞.
Hence the noninformative prior distribution πN(σ) can be made proper
by determining the posterior πN(σ|xi) based on a single observation and the
prior πN(σ). This posterior is then used as a prior for other elements in the
sample. Moreover, we get R0∞cπN(σ|x
i) dσ = 1 with c = 2xi.
The difference between the prior distribution πN(σ) and posterior prior
distribution πN(σ|xi), is that the former is flatter in distribution than the
latter. The prior distribution is flat in the sense that the mass of probability is too spread out across the range of σ. The problem is then the same for πN(σ) as with the integrand in the definite integral in a). However, since the posterior prior distribution πN(σ|x
i) have enough of its mass of probability
gathered together it integrates to a finite value.
7.4
D: The numerical integration
In theorem 3 is provided an analytically derived version of the marginal distribution for the random effects model up to one integral which depends on σλ. This integral needs to be treated numerically which will be the topic of
this appendix. For simplicity of notation, let the integral from theorem 3 be denoted by I = m(x|MR) =
R∞
0 f (σλ) dσλ and the corresponding numerical
calculation be denoted by ISimpson. We use the Simpson rule to numerically
of the Trapezoidal rule. The Simpson rule computes the integral of the function f (σλ) defined in the interval σλ ∈ (a, b). It requires that the interval
(a, b) first be partitioned into n subintervals of equal length l such that a−b = n · l. Let σλi = a + i · l and f (σλi) = fi, then I can be numerically calculated
by the Simpson rule in the following way
ISimpson = l 3 f1+ 4f2+ 2f2+ 4f3+ 2f2+ · · · + 2fn−2+ 4fn−1+ fn .
The range of σλ ∈ (0, ∞) imply that a = 0 and b = ∞. We will follow the
method of [Bodnar, Muhumuza & Possolo, 2020] and make the transforma-tion σλ = tan(ω) which is a one-to-one map of the interval ω ∈ (0, π/2) into
σλ ∈ (0, ∞). This allows us to compute the integral over the bounded range
ω ∈ (0, π/2) instead over the unbounded range σλ ∈ (0, ∞). The Jacobian
for this change of variable is |dωdtan(ω)| = 1/cos(ω)2.
I = Z ∞ 0 (2π)−n−12 det(V + σ2 REI) −1/2 p1T(V + σ2 REI)−11 exp− 1 2x TQ(σ2 RE)x 1 σRE dσRE = Z π/2 0
(2π)−n−12 det(V + tan2(ω)I)−1/2
p1T(V + tan2(ω)I)−11 exp− 1 2x TQ(tan2(ω))x tan(ω)cos2(ω) dω . where
Q(tan2(ω)) = (V + tan2(ω)I)−1− (V + tan
2(ω)I)−111T(V + tan2(ω)I)−1
1T(V + tan2(ω)I)−11 .
A numerical calculation of an integral is not exact and some error, however small, is bound to follow. It is common lingo in mathematics to say that a nu-merical calculation is only an approximation to the corresponding analytical calculation. An approximation can easily be confused with an estimation and vice versa. The word ’approximation’ must not be taken as synonymous with the word ’estimation’ because these two words carry different associations. The main difference lie in the nature of the errors for respective concept. For example, the error of an approximation specify the exactness by which an integral has been calculated. Let ε denote the error of an approximation, then I = ISimpson± ε which implies that I ∈ (ISimpson− ε, ISimpson+ ε). In
this sense it is said that the transcendental number π is approximately 3.14, which means that π is known up to two decimals and have an error in the magnitude of ε ∼ 10−3. At the other hand however, the error of an estima-tion specify a precision, which is reflective of an unbounded variability. In other words, an estimator can take on any value in its range regardlessly of its precision. Although some values will be increasingly unlikely to be taken by the estimator as the precision increases.
The Simpson rule will be implemented in Appendix H for the code in R about the computation of the IBF. For the gravitational constant we have convergence up to six decimals of the integral Isimpsonat N = 50 000 intervals.
7.5
E: The simulation
Consider a random variable Z = [Z1, · · · , Zd]T that follow a multivariate
nor-mal distribution with mean vector µ = [µ1, . . . µd]T and (d × d)-dimensional
covariance matrix Σ, Z ∼ N (µ, Σ). Let b be a constant and D be a known square matrix, then the transformation T = DZ + b follow the multivariate normal distribution with mean vector Dµ + b and covarance matrix DΣDT,
T ∼ N (Dµ + b, DΣDT) [Rizzo, 2008, p.70-72]. This is the information we
need to simulate a random sample from εLS, εRE and λRE.
Start by the random effects model, given that we know µ and σRE. Since
the covariance matrix V is positive definite we can use Cholesky decomposi-tion to write V = CTC. For ε
RE ∼ N (0, V) first generate iid standard
nor-mal observations, that is Z = [Z1, ..., Zd]T where Z1, . . . , Zd iid
∼ N (0, 1). This is the same as generating one observation z from Z ∼ N (0, I), since a diago-nal matrix I correspond to mutually independent variables Z1, . . . , Zd. Then
we obtain a simulated sample for εRE by the transformation εRE = CTZ.
Because then εRE ∼ N (CT0 + 0, CTI(CT)T) = N (0, CTC) = N (0, V). We
can then similarly generate a random sample for λRE ∼ N (0, σRE2 I). We
do this by first generating iid normally distributed observation. This corre-sponds to a generated sample from λRE ∼ N (0, σ2REI). Then a simulated
Then we have the location-scale model, given that we know µ and σLS. A
simulated sample for εLS ∼ N (0, σLS2 V) can be generated by first generating
iid random samples Z1, . . . , Zd iid
∼ N (0, σ2
LS) since this corresponds to an
observation from Z ∼ N (0, σ2
LSI). Then we make the transformation εLS =
σLSCTZ, because then εLS ∼ N (CT0 + 0, σLS2 CTI(CT)T) = N (0, σLS2 V).
Then a simulated sample from the location-scale model is µ1 + εLS.
7.6
F: Symbols and clarifications
X is a random variable.
x is a scalar, a realization of X.
If A is some matrix, than AT is its transpose.
X = [X1, X2, . . . , Xn]T.
x = [x1, x2, . . . , xn]T.
1 = [1, 1, . . . , 1]T.
I is the idendity matrix.
V = u2 1 ρ12u1u2 · · · ρ1nu1un ρ21u2u1 u22 · · · ρ2nu2un .. . ... . .. ... ρn1unu1 ρn2unu2 · · · u2n = u1 0 · · · 0 0 u2 · · · 0 .. . ... . .. ... 0 0 · · · un 1 ρ12 · · · ρ1n ρ21 1 · · · ρ2n .. . ... . .. ... ρn1 ρn2 · · · 1 u1 0 · · · 0 0 u2 · · · 0 .. . ... . .. ... 0 0 · · · un = U1/2ΩU1/2 .
εLS represents the heterogeneity in the location-scale model and is a
εRE represents the variation in X dependent only on V and is a (n × 1)
column random vector.
λRE represents the heterogeneity in the random effects model and is a
(n × 1) column random vector.
µ represents the parameter of interest and is a scalar random variable. σLS and σRErepresents the unexplainable interlaboratory heterogeneity
in the location-scale model and the random effects model, respectively. They both are scalars and are considered as nuisance parameters. θ is an arbitrary parameter, scalar or vector.
Θ is the parameter space of θ. π(θ) is a prior distribution for θ.
πN(θ) is a noninformative prior distribution for θ and we assume it to
be improper unless otherwise stated.
For example a multivariate normal distribution X ∼ N (µ, Σ), where X has dimension d, has the probability density function:
f (x|µ, Σ) = 1 (2π)d/2pdet(Σ)exp − (x − µ) TΣ−1 (x − µ) 2 .
The probability density function of εLS ∼ N (0, σ2LSV) is then:
f (εLS|0, σLS2 V) = 1 (2π)d/2σn LSpdet(V) exp−ε T LSV −1ε LS 2σ2 LS .
7.7
G: Data on the Newtonian constant of gravitation
The source of the data on the gravitational constant G is [Merketas et al, 2019] and we express the estimates and uncertianties of G as below:
Study Estimate (G/10−11m3kg−1 s−2) Uncertainty NIST-82 6.67248 0.00043 TR&D-96 6.6729 0.00050 LANL-97 6.67398 0.00070 UWash-00 6.674255 0.000092 BIPM-01 6.67559 0.00027 UWup-02 6.67422 0.00098 MSL-03 6.67387 0.00027 HUST-05 6.67222 0.00087 UZur-06 6.67425 0.00012 HUST-09 6.67349 0.00018 JILA-10 6.67260 0.00025 BIPM-14 6.67554 0.00016 LENS-14 6.67191 0.00099 UCI-14 6.67435 0.00013 HUST-TOS-18 6.674184 0.000078 HUST-AAF-18 6.674484 0.000078
7.8
H: Program code in R
################################################################################## ###################################
# Data + some definitions #
################################################################################## ################################### X_gravity <- c(6.672480,6.672900,6.673980,6.674255,6.675590,6.674220,6.673870,6.672220,6.674250,6.673490, 6.672600,6.675540,6.671910,6.674350,6.674184,6.674484) ### observations u <- c(4.3e-04,5.0e-04,7.0e-04,9.2e-05,2.7e-04,9.8e-04,2.7e-04,8.7e-04,1.2e-04,1.8e-04,2.5e-04,1.6e-04,9.9e-04,1.3e-04,7.8e-05,7.8e-05) ### uncertainties
n<-length(u) ### number of observations bone<-rep(1,n) ### vector of 1:s
In<-diag(n) ### identity matrix
Jn<-matrix(1,n,n) ### matrix of 1:s = bone'*bone
V<-diag(n) ### for the correlation matrix. Now: diagonal u2<-u^2
U<- diag(u)%*%V%*%diag(u) ### the correlation matrix
C_REM_n<-(2*pi)^(-0.5*(n-1)) ### constant of normalization, REM C_LSM_n<-(pi)^(-0.5*(n-1)) ### constant of normalization, LSM
################# numerical integration, definiton of nodes and weights N<-100 ### number of intervals
h <- (pi/2)/(N-1) ### interval length
y<-seq(0,pi/2,h) # Nodes ### to be 'x-values' (=sigma_REM) tan_y<-tan(y[2:N]) ### inserted into tan formula cos_y2<-cos(y[2:N])^2 ### inserted into cos^2 formula
w[seq(3,N-2,2)] <- 2 w <- w*h/3 # Weights
######################################################## training samples l<-2 ## two observation is enough for propriety
bone_l<-rep(1,l) ### vector of 1:s for training sample Il<-diag(l) ### identity matrix for training sample
Jl<-matrix(1,l,l) ### matrix of 1:s = bone'*bone for training sample set_l<-combn(1:n, l) ### all possible combinations of set with two elements length_l<-ncol(set_l) ### number of combinations
C_REM_l<-(2*pi)^(-0.5*(l-1)) ### constant of normalization for training sample, REM C_LSM_l<-(pi)^(-0.5*(l-1)) ### constant of normalization for training sample, LSM
################################################################################## ##########################
# Simulation, Cholesky decomposition ## general constants
mu<-0 ###value for mu
sigma_LS<-1.5 ###value for sigma_{LSM} sigma_RE<-0.02 ###value for sigma_{REM}
cholU<-t(chol(U)) ###square root of U based on Cholesky decomposition;
###the function t() is used since R gives the cholesky-root as an upper triangular matrix
# Simulating a sample from the LSM # X_sim_LS<- rep(mu,n)+sigma_LS*cholU%*%rnorm(n)
# Simulating a sample from the REM # X_sim_RE<- rep(mu,n)+sigma_RE*rnorm(n)+cholU%*%rnorm(n)