Bayesian Model Selection with Intrinsic Bayes Factor for Location-Scale Model and Random Eﬀects Model

(1)

Bayesian Model Selection with Intrinsic Bayes

Factor for Location-Scale Model and Random

Effects Model

Author:

Viktor Eriksson

(Date of Birth – 911001)

Semester: 4th _{semester in master’s program}

Course name, level and credits: Independent Project II, Advanced, 15 Credits Subject: Statistics

Supervisor: Olha Bodnar Examiner: Hoang Nguyen

¨

(2)

Abstract

In this project the concept of Bayesian model selection have been introduced for non-informative priors. In particular the Intrinsic Bayes Factor (IBF) is used for comparison of two models: the location-scale model and the random effects model. These two models are the most used in meta-analysis and for the adjustment of fundamental physical constants such as Planck constant and Gravitational constant. IBF, defined as the ratio of predictive posterior distributions, has been de-rived for two models. In the case of the location-scale model the ana-lytical expression of the predictive posterior distribution is obtained, while in the case of the random effects model the posterior predictive distribution is presented as a one-dimensional integral only. This in-tegral is computed numerically using Simpson’s rule. The theoretical findings are applied to real-life data.

(3)

1 Introduction

In this introduction we start by explaining the background and purpose of the two models that we aim to make a selection between. Afterwards we consider the models themselves, namely the location-scale model and the random effects model. Then we consider a Bayesian treatment of these two models and the objective Bayesian approach to model selection. We end by discussing the Bayes factor. The aim of the thesis is to employ a method for objective Bayesian model selection, namely the intrinsic Bayes factor. This method is then used to make a selection between the location-scale model and the random effects model. There exists no software for an automatic compu-tation of the intrinsic Bayes factor. We code the compucompu-tational algorithms ourselves in R (given in appendix G) and derive the needed mathematical closed forms in this thesis.

1.1 The background

In metrology and in clinical studies there exists parameters of such impor-tance that a multitude of studies have been made to estimate these param-eters. For a physicists such an important parameter could be a natural constant, such as the Newton gravitational constant or the Planck constant (see, [Mohr et al, 2016] for a compilation of multiple measurements of these physical constants). In medicine it could be a log odds ratio that is relating patients who are suffering from carotid artery stenosis with those patients who suffered from stroke (see, [Meier et al, 2010] for a compilation of a mul-tiple such studies). These are all examples of important parameters for which multiple individual studies have been made for parameter estimation.

Each individual study reports an estimate of the parameter together with its uncertainty. In physics and medicine a parameter is estimated by tak-ing measurements in a laboratory. Great care is taken in laboratories to determine the measurement margin of error such that it accounts for every identifiable source of error. We can therefore assume that an individual study estimate is internally consistent, i.e., its uncertainty takes into account of ev-ery identifiable (hence explainable) source of uncertainty.

(5)

Although the individual studies are internally consistent, they often fail to be externally consistent. Because when the estimates together with their uncertainties from all individual studies are brought toghether in a scatter-plot, it is usually revealed that the dispersion of the estimates is substantially greater than what the reported uncertainties suggest them to be. The re-ported uncertainties takes into account every identifiable source of error in each individual study, yet these uncertianties fail to explain the full varia-tion between the estimates satisfactory. The extra variavaria-tion is caused by external inconsistancy, i.e., inconsistancy between the laboratories or inter-laboratory inconsistancy. We label this extra variation the heterogeneity. The heterogeneity could also be refered to as the excess variance or as the dark uncertainty [Thompson & Ellison, 2011]. It is important to note that the heterogeneity is unexplainable and cannot be resolved easily.

It is custom to combine the estimates from all individual studies into a pooled consensus value. Because if we let each new study replace the esti-mate from the previous study, we would then be wasting valuable data on the parameter. The undertaking to form a consensus value from individual stud-ies is called an interlaboratory study. Meta-analysis is an often used method for this purpose. Its objective is to combine these estimates into a consen-sus value, here denoted by µ, representative of all individual studies taken together. The reported uncertainties of the individual studies are collected into a covariance matrix here denoted by V. The diagonals of V contain the variances u2

i of the estimates and the off-diagonal entries contains covariances

ρijuiuj, where ρij is the correlation between the ith and the jth study (see

appendix F for an explanation of various symbols used in this thesis). In a meta-analysis it is important that the heterogeniety is accounted for when forming the consensus value or else the consensus value will be biased. But how can an unexplained heterogeneity be accounted for?

The ending question of the last paragraph leads us to the very two models of interest in this thesis, namely the location-scale model and the random effects model. Both of these models are used in a meta-analysis to form a consensus value while acounting for an unexplainable heterogeneity. In both of these models we let the heterogeneity be denoted by σ, although we index this parameter according to the model: σLS for the location-scale model and

(6)

σRE for the random effects model.

1.2 The models

Under the location-scale model the data generating process is described by X = µ1 + εLS, where εLS ∼ N (0, σLS2 V), 1 and 0 are column vectors of ones

and zeros respectively. This model as generally favored by physicists in pa-rameter estimation [Koepke et al, 2017, p.40]. This model is used to adjust for inconsistent data in [Weise & W¨oger, 2000] and in [Mohr et al, 2008] and is related to the Birge ratio, see [Birge, 1932].

Under the random effects model the data generating process is described by X = µ1 + λRE + εRE where λRE ∼ N (0, σ2REI) and εRE ∼ N (0, V).

This model is generally favored by medical researchers in parameter estima-tion [Koepke et al, 2017, p.40]. It has been used in chemistry for interlab-oratory comparisons in [Mandel & Paule, 1970] and in [Toman et al, 2012]. Other references on the application of the random effects model can be seen in [Bodnar et al, 2016, p.47]. In [Bodnar, Link & Elster, 2016] an objective Bayesian inference is proposed for the generalized random effects model, the reference prior distribution and the posterior distribution of µ is derived and then used to estimate the Planck constant.

In [Bodnar et al, 2016] both the location-scale model and the random effects model have been tested for robustness to model misspecification by computing coverage intervals. The test computes the Bayesian credible in-terval for the parameter µ by employing a simulated random sample. The experiment is set to generate a sample and then compute the credible interval with respective model, this is repeated 50000 times and then the coverage probability is defined as the success rate at which the credible interval succeed at covering µ. The results show that the random effects model is more robust to model misspecification than the scale model: when the location-scale model is true, but the random effects model is used to estimate µ, the coverage probability is roughly 95% which is the same coverage probability when the random effects model itself is true! However when the random effects model true, but the location-scale model is used to estimate µ, then the coverage probability is only roughly 0.45%, while when the location-scale

(7)

model is true then the coverage probability is 0.95%. If the location-scale model is actually true, then the random effects model will still do a good job at estimating µ. On the other hand, if the random effects model is actually true, then the location-scale model will make a bad estimate of µ. Note how-ever that both models have good coverage probabilities (≈ 95%) given that the used model is the true model.

1.3 A Bayesian approach

The location-scale model and the random effects model are treated within a Bayesian framework and therefore the unknown parameters of the mod-els (µ, σLS) and (µ, σRE) are assumed to be random variables described by

a probability density function (see appendix A for a brief discussion about the differences between frequentist statistics and Bayesian statistics). Both the models have in common the known positive definite matrix V (which is constant) and the unknown parameter µ. The parameter µ is the target parameter while the parameters σLS and σRE are nuisance parameters

rep-resenting heterogeneity for respective model.

Consider the frequentist viewpoint of parameter estimation and then com-pare this to the Bayesian viewpoint. Their main differences regard the in-terpretation of the parameter µ and the uncertainty of the estimate of µ [Koepke et al, 2017, p.44]. In the frequentist perspective, the parameter µ is an unknown constant. The frequentist makes an evaluation of the estimate of µ withing a repeatability context; that under imaginative repetitions of the experiment in which data are collected and the estimate of µ is computed, then the frequentist evaluates how the estimate of µ fluctuates from one sam-ple to another. Hence the frequentist viewpoint of uncertainty is based on the (objective) sampling variability. In the Bayesian-theoretical perspective µ is unknown but we may formulate our prior knowledge of µ in a prior prob-ability density function for µ. Then µ is viewed as a random variable and the prior distribution can then be used together with the observed data to infer the value of µ. The Bayesianist view that data as a means to update the prior distribution into a posterior distribution with a reduced dispersion of plausible outcomes for µ. The variation in µ is based on the priori knowledge and the observed data, not on repeatability. If the prior distribution of µ is subjective, than so is the estimate and uncertainty of µ partly subjective in

(8)

the posterior distribution. The Bayesian approach to parameter estimation can however be made objective with a noninformative prior distribution.

1.4 Bayesian model selection

There are various methods for objective Bayesian model selection, however the method usually have some limitation or difficulty about it. Because the method either requires that certain conditions are satisfied for it to work well or the method is difficult to implement in general. For example the Bayes factor by the conventional prior approach can be difficult in imple-mentation and there are no general method to derive the conventional priors [Berger & Pericchi, 2001, p.148]. The Bayes information criterion (BIC) can be used as an approximation to the Bayes factor for large sample sizes, how-ever this approximation is imbedded with difficulties for models with irregular asymptotics [Berger & Pericchi, 2001, p.153].

The intrinsic Bayes factor is our method of choice in this project and we properly present it in section 2.2. The benefit with the intrinsic Bayes factor is that it is automatic (objective) in the sense that it only depends on observed data and noninformative prior distribution. It may however be computationally intensive and unstable [Berger & Pericchi, 1996, p.120-121]. The intrinsic Bayes factor makes use of a minimal training sample for pa-rameter estimation. To make for a stable result it is conventional to take the average or median of the intrinsic Bayes factors computed over all pos-sible training samples. The fractional Bayes factor offers an alternative way of Bayesian model selection when noninformative priors are employed (see [O’Hagan, 1995, p.115]).

So why use a Bayesian approach to model selection? Some good points are stated in [Berger & Pericchi, 2001, p.138-140] and are as follows. The Bayes factor is interpreted as an odds factor, which is easy to understand. It is an automatic Ocham’s razors since it favor a simple model over a complex model. The Bayes factor is a consistent method, which mean that if any of the compared models is true, then the true model is selected (given some mild conditions are satisfied). The Bayes factor does not require models to be nested (one being a special case of the other), which make it possible

(9)

to compare different types of models in which the parameters may vary in dimension. Some of the difficulties with the Bayes factor is also stated in [Berger & Pericchi, 2001, p.142]. For example the Bayes factor can be dif-ficult to compute. The Bayes factor return indeterminate answers when we have different dimensions of the parameter spaces in conjunction with im-proper prior distributions. The Bayes factor does not usually return good answers when we use vague proper priors, we should instead use improper noninformative priors. The same parameter can have different interpretation in different models, and so the prior distribution for this parameter must be specified differently in different models. This may be counterintuative when we are formulating subjective priors.

The influence of the prior distribution on the Bayes factor is greater than it is on the estimation of a parameter. This is because the weight of the prior distribution is not reduced with an increasing sample size in the Bayes factor [Berger & Pericchi, 1996]. The Bayes factor is therefore particularly vulnera-ble to misspecifications in the prior distribution and for this reason we look for an objective noninformative prior distribution. [Berger, Bernado & Sun, 2009] developed the so called Berger and Bernando reference prior which is an ob-jective noninformative prior distribution and will be used in this thesis. The Berger and Bernado reference prior is recommended for use in the intrinsic Bayes factor for large sample sizes [Berger & Pericchi, 1996, p.121].

(10)

2 Theory of Bayesian Model Selection

2.1 The Bayes Factor

In this subsection we introduce the Bayes factor, its interpretation, and its use for model comparison. Let M1, . . . , Mi, · · · , Mq denote the models that

we aim to make a comparison between. Under model Mi the data X follow

the probability density function fi(x|θi) with θi ∈ Θi, to which we also refer

to as the likelihood function. In the Bayesian approach to model selection we start by assigning for each model Mi a prior probability P (Mi) of the model

Mi being correct and a prior distribution πi(θi) of the parameter θi. Then

the posterior probability P (Mi|x) of the model Mi being correct when the

sample x is observed is given by

P (Mi|x) = q X j=1 P (Mj) P (Mi) Bji(x) !−1 , where Bji(x) = mj(x) mi(x) = R fj(x|θj)πj(θj)dθj R fi(x|θi)πi(θi)dθi .

is the Bayes factor between Mj and Mi.

The posterior probability P (Mi|x) is however dependent on subjectively

chosen prior probabilities P (Mi) for i = 1, 2, . . . , q of the model Mi being

cor-rect. If one prefers the Bayesian model selection procedure to be objective, then she/he shall assign equal prior probabilities to the considered models, that is P (M1) = ... = P (Mq) = 1/q. In this case the posterior probability

P (Mi|x) used for the selection of a model will depend on the Bayes factors

Bji(x) only. In a special case, when q = 2, the Bayes factor Bji(x) can alone

be used to make an objective Bayesian model selection.

The Bayes factor is interpreted as the odds in favor of Mj against Mi

based on the evidence in the data x. The inequality Bji > 1 is supportive

of model Mj, while Bji < 1 is interpreted as evidence against Mj. Bayesian

model selection by the Bayes factor can be understood as a generalization of Bayesian hypothesis testing. We should note that the prior distributions

(11)

πi(θi) of the parameter θi are subjective because they are decided prior to

the data, but they are objective if we let them be noninformative.

The marginal distribution of the data under model Mi is given by

mi(x) =

Z

θi∈Θi

fi(x|θi)πi(θi)dθi .

It measures the probability of observing x under model Mi. As a rule of

thumb, we can say that a high probability of observing x under model Mi

corresponds to a good fit of the model to the data. This rule can easily be abused however, since for example a model can be introduced to overfit the data. Then the overfitted model has a very high value for mi(x), yet it

pre-dicts new values with a high uncertainty. It is remarkable that only models which can be supported by the underlying theory of the application should be compared and not arbitrary functional forms for the sake of maximizing mi(x).

The Bayes factor can also be interpreted as an odds factor if the marginal distribution mi(x) is proper, such that

R

x∈Ωmi(x)dx = 1 for all i = 1, . . . , q.

This requirement may seem strange, given that mi(x) as a probability density

function should always integrate to unity. However, this may not always be the case when an improper prior distribution, denoted by πN

i (θi), is used (see

appendix B for a detailed explanation). Such a situation is usually present when the Bayes factor is computed by using a noninformative prior, i.e., when objective Bayesian inference are preferable, since most noninformative prior distributions are improper. An improper prior distribution π_iN(θi) cannot

have a normalizing constant and therefore it is not a true probability density function. If we multiply πN

i (θi) by an unspecified constant ci and determine

ci by normalizing πNi (θi), then we get

Z θi∈Θi ciπNi (θi)dθi = 1 ⇔ ci = 1 R θi∈Θiπ N i (θi)dθi = 1 ∞ = 0 .

The result of ci = 0 means from a practical point of view that there exists

no normalizing constant for an improper prior distribution πN

(12)

an improper prior distribution does not possess a normalizing constant, it is only meaningful to know the distribution up to a constant of proportion-ality. This means that the improper prior distribution can be written as πN

i (θi) ∝ hi(θi). One could however include some proportionality constant

ci and write πNi (θi) = cihi(θi), although ci does not normalize πiN(θi).

Consider computing the Bayes factor for two models Mj and Mi where

j 6= i. Model Mj have a likelihood function fj(x|θj) and an improper prior

distribution πN

j (θj) = cjhj(θj). Model Mi have a likelihood function fi(x|θi)

and a proper prior distribution πi(θi). Then the Bayes factor is given by

Bji(x) = cj

R fj(x|θj)hj(θj)dθj

R fi(x|θi)πi(θi)dθi

,

and it depends on an arbitrarily chosen value for cj. This is clearly a problem

since any value for cj can be picked up by the statistician which would greatly

alter the support of a model given by the data x. If model Mi also have an

improper prior distribution π₂N(θi) = cihi(θi), then we have

Bji(x) = cj ci R fj(x|θj)hj(θj)dθj R fi(x|θi)hi(θi)dθi ,

which depends on the ratio between two arbitrarily chosen constants. This is a critical disadvantage with the application of the Bayes factor to model selection as the following citation from a textbook in Bayesian data analysis shows ”This fully Bayesian approach [Bayes factors] has some appeal but we generally do not recommend it because, in practice, the marginal likelihood is highly sensitive to aspects of the model that are typically assigned arbitrarily and are untestable from data.” [Gelman et al, 2014, p. 182].

2.2 The Intrinsic Bayes Factor

The intrinsic Bayes factor (IBF) is a solution to the problem that is imposed on the Bayes factor by the improper prior distribution. Namely that under an improper prior distribution the Bayes factor has no interpretation as an odds factor. The IBF is different from the ordinary Bayes factor in that it makes use of a training sample. The training sample is employed for the pur-pose of transforming the improper prior distribution to the proper posterior

(13)

distribution, which is then used as a prior for the rest of the elements in the sample. Because then the IBF is interpretable as an odds factor.

The training sample x`is a subset of x, x` ⊂ x, of size m. Let x(`) = x−x`

be the data set with the training sample removed. Then x` and x(`) forms a

partition of x. We have the following definition:

Definition 1. [Berger & Pericchi, 2001, p.149] ”A training sample, x`, is

called proper if 0 < mN

i (x`) < ∞ for all Mi, and minimal if it is proper and

no subset is proper”, where mN

i (x`) denotes the marginal distribution of x`

computed under the noninformative prior π_iN(θi).

Why does a training sample make an improper prior distribution become proper? An explanation is that the training sample enables the improper prior distribution to better describe which values of θi are the more

proba-ble. Then the cumulative distribution function (CDF) of πN

i (θi|x_`) converges

for all θi ∈ Θi and the posterior prior distribution πiN(θi|x`) is proper.

To understand this recall that the impropriety of a prior distribution fol-lows from it being noninformative. When we use a noninformative prior distribution we admit minimal knowledge of the true value of θi. For this

reason they are made to be vague or ’flat’ in distribution. But then the mass of probability becomes ’too much’ evenly distributed across the range of θi

in order for the cumulative distribution function of πN

i (θi) to converge for all

θi ∈ Θi (see appendix C for examples and further explanation). Oppositely,

as the information about θi increases then the tails of the distribution

di-minishes in size. Because when we have more information about θi, then the

mass of probability will accumulate in around the vicinity of the true value of θi. The distribution will be more mound-shaped rather than flat with

increasing information about θi. With small enough tails in the distribution

the CDF of πN

i (θi|x`) will converge for all θi ∈ Θi.

The purpose of a noninformative prior distribution is for the Bayesian approach to model selection to become objective. If we need the prior dis-tribution to be both informative and objective we have to use observed data for parameter estimation, replacing πN

(14)

lose data points to the specification of πN

i (θi|x`) that could have been used

for computing the Bayes factor. We are sacrificing a portion of the data set x, namely the training sample x`, for parameter estimation instead of using

it for model selection. We want as much data as possible for the computation of the Bayes factor since the overall goal is to make a comparison between two models. This is why we only want the minimal training sample.

The minimal training sample is usually of the same size as the number of parameters in the model. But there are exceptions to this rule, if for example the prior distribution is proper to some parameters then the mini-mal training sample size may be less then the number of parameters in the model [Berger & Pericchi, 1996]. As an example, suppose we have a joint prior distribution of two parameters that is improper, π(µ, σ2_{) ∝ 1/σ}2_{, then}

only two observations from the data set should suffice as a training sample. Which two observations should we pick for a training sample? There are n(n − 1)/2 combinations of choice to select two observations from x. We should pick every combination to compute the IBF, which implies we must make n(n − 1)/2 computations of the IBF for each such combination. Then we draw conclusions based on a summary of the differently computed IBFs, such as the average IBF or the median IBF.

Let the IBF be denoted by B_jiI(x(`)|x`) since it depends on x(`) and is

conditioned on the training sample x`.

Definition 2. The intrinsic Bayes factor is

There is an easier way to compute the IBF than using the definition. We may instead develop the IBF as follows

(15)

B_jiI(x(`)|x`) = mN j (x(`)|x`) mN i (x(`)|x`) = m N j (x(`), x`) mN j (x`) mN i (x(`), x`) mN i (x`) = m N j (x) mN j (x`) mN i (x) mN i (x`) = m N j (x) mN i (x) × m N i (x`) mN j (x`) = B_jiN(x) × B_ijN(x`).

Recall that the original problem with Bayes factors was that of improper prior distributions. We used two models Mj and Mi that has fj(x|θj) and

fi(x|θi) respectively, with improper prior distributions πjN(θj) = cjhj(θj) and

π_iN(θi) = cihi(θi). When computing the IBF of Mj and Mi we have

The important point is that the arbitrary constants of proportionality cj and ci will cancel in the IBF [O’Hagan, 1995]. Hence the IBF does not

depend on anything arbitrary or subjective and can therefore be used for an objective Bayesian model selection.

We summarize this subsection by saying that a noninformative prior dis-tribution is used for an objective approach to Bayesian model selection. The IBF is a method in which the noninformative prior distribution is made proper by the use of a minimal training sample. The improper πN

i (θi) is then

replaced by the proper πN

i (θi|x`) and the rest of the data x(`) is employed

for model selection resulting in the IBF. The IBF measures the odds in favor of Mj against Mi based on evidence in the data, without the arbitrariness of

(16)

3 Objective Bayesian Analysis in

Location-Scale and Random Effects Model

In this section we will establish all theoretical results that are needed to compute the IBF for the location-scale model and the random effects model. We will start by presenting the location-scale model by a definition, we men-tion its use in statistics and then formulate necessary theorems for the final calculation of the IBF. We will treat the random effects model likewise and then end this section by comparing the two models and mention some of the literature on the subject.

3.1 The Location-Scale Model

The location-scale model is related to the Birge method, a popular method in physics and metrology for adjusting uncertainties to adapt for external inconsistency. The Birge method is used under the assumption that due to external inconsistency, i.e., heterogeneity, the reported uncertainty for each estimate is understated. The Birge method multiplies each uncertainty by a common factor that is called the Birge ratio which is a sort of averaging of uncertainties. Because then under the Birge ratio all the uncertianties are adjusted to be uniformly understated and hence mutually consistent. In [Bodnar & Elster, 2014] the Birge ratio is studied within a Bayesian frame-work in relation to the location-scale model.

Definition 3. Let V be a known positive definite matrix, µ an unknown location parameter and σLS an unknown scale parameter. Then the

location-scale model MLS of the data X is described by (i) and (ii).

(i) X = µ1 + εLS.

(ii) ε ∼ N (0, σ2 LSV).

The components (i) and (ii) of the location-scale model can be combined to form X ∼ N (µ1, σ2

LSV). This means that X follows a multivariate

nor-mal distribution with the mean µ1 and the covariance matrix σ_LS2 V. The positive definite matrix V contains all the reported uncertainties of the pa-rameter µ while σLS is an unknown factor that represents the unexplainable

(17)

the context of an application, while σLS is a nuisance parameter included to

the model to capture the heterogeneity. The idea is that the elements in V can be understated due to the presence of heterogeneity. By multiplying V by σ2

LS, the elements in V then becomes uniformly adjusted for by the

het-erogeneity. By using the notation X|µ, σLS, MLS we make it explicit that the

data is conditioned on the two parameters µ and σLS under the location-scale

model.

Definition 4. Under the location-scale model the data X follow the mul-tivariate normal distribution N (µ1, σ2

LSV) with probability density function

given by f (x|µ, σLS, MLS) = σ_LS−n (2π)n/2exp − 1 2σ2 LS (x − µ1)TV−1(x − µ1) .

For a full Bayesian treatment of the location-scale model we need the joint probability density function f (x, µ, σLS|MLS) = f (x|µ, σLS,MLS)π

N_{(µ, σ} LS)

where πN_{(µ, σ}

LS) is the noninformative reference prior of the location-scale

model. The Berger & Bernado reference prior has been derived for the general location-scale model in [Fern´andes & Steel, 1999] and is given by

πN(µ, σLS) ∝

1 σLS

.

In the derivation of the marginal distribution of X under the location-scale model we will use the following two lemmas.

Lemma 1. Let V be positive definite. Then it follows that

(x − µ1)TV−1(x − µ1) = xTQx + 1TV−11 µ − 1 T_V−1 x 1T_V−1₁ 2 , where Q = V−1−V −1₁₁T_V−1 1T_V−1₁ .

Proof. We can show this by first evaluating the left-hand side of the equation. It holds that

(x − µ1)TV−1(x − µ1) = (xT − µ1T_)(V−1

x − µV−11)

(18)

Completing the square for µ leads to 1TV−11 µ − x T_V−1_x xT_V−1₁ 2 + xTV−1x − (1 T_V−1_x)2 1T_V−1₁ .

Notice that (1T_V−1_x)2 _{= x}T_V−1₁₁T_V−1_{x because 1}T_V−1_{x is a scalar}

and is therefore equal to its own transpose. Hence,

1TV−11 µ − x T_V−1_x xT_V−1₁ 2 + xTV−1x −x T_V−1₁₁T_V−1_x 1T_V−1₁ = 1TV−11 µ − x T_V−1_x xT_V−1₁ 2 + xTQx .

The proof is complete.

Lemma 2. Let n > 1 and let V be positive definite. Then it follows that

Z ∞ −∞ (x − µ1)TV−1(x − µ1) −n/2 dµ = (xTQx)−n/2 Γ n−1₂ p(n − 1)π q 1 n−1 xT_Qx 1T_V−1₁ Γ n₂ .

Proof. By lemma 1 we may rewrite the integrand as follows

(x − µ1)TV−1(x − µ1) −n/2 = xTQx + 1TV−11 µ − 1 T_V−1_x 1T_V−1₁ 2!−n/2 = xTQx−n/2 1 + 1 n − 1 µ − 1₁TTV_V−1−1x₁ q 1 n−1 xT_Qx 1T_V−1₁ !2!−(n−1)+1₂ .

(19)

The integrand is proportional to the probability density function of a scaled and shifted t-distribution with n − 1 degrees of freedom, location parameter 1TV−1x

1T_V−1₁ and scale parameter

q

1 n−1

xT_Qx

1T_V−1₁. Using the fact that a

definite integral of a density function over its support must equal to unity, the statement of lemma 2 follows.

Theorem 1. Let n > 1 and let V be positive definite. Then under the location-scale model and reference prior πN_{(µ, σ}

LS) = 1/σLS, the marginal distribution of X is given by m(x|MLS) = Γ(n−1₂ ) xT_Qx−n−1₂ det(V)1/22πn−12 √ 1T_V−1₁ .

Proof. The marginal distribution of the data x is obtained by integration of the parameters µ and σLS in the joint probability density function f (x, µ, σLS|MLS).

It holds that m(x|MLS) = Z ∞ 0 Z ∞ −∞ f (x|µ, σLS, MLS)πN(µ, σLS) dµ dσLS = Z ∞ 0 Z ∞ −∞ det(V)− 1 2 (2π)n/2 (σLS) −n exp − (x − µ1) T_V−1_{(x − µ1)} 2σ2 LS 1 σLS dµ dσLS =       σ_LS2 = ζ dσLS = 1 2√ζdζ σLS = 0 ⇔ ζ = 0 σLS = ∞ ⇔ ζ = ∞       = det(V) −1₂ 2(2π)n/2 Z ∞ −∞ Z ∞ 0 (ζ)−(n2+1)exp − (x−µ1)TV−1(x−µ1) 2 ζ dζ ! dµ .

The inner integral in respect to ζ has an integrand that is the kernel of the density function of the inv-gamma(n/2, (x−µ1)T_V−1_{(x−µ1)/2). Hence,}

(20)

m(x|MLS) = det(V)− 1 2 2(2π)n/2 Z ∞ −∞ Γ n₂ (x−µ1)T_V−1_(x−µ1) 2 n/2 dµ = det(V) −1₂ Γ n₂ 2πn/2 Z ∞ −∞ (x − µ1)TV−1(x − µ1)−n/2 dµ .

By lemma 2 the proof is finished.

Theorem 2. Let V be positive definite. Under the location-scale model and reference prior πN_{(µ, σ}

LS) = 1/σLS, the size of a minimal training sample x`

is given by m = 2, i.e., x` = {xi, xj}, where xi, xj ∈ x and i 6= j. Moreover,

it holds that m(x`|MLS) = 1 2 q (det(V`))(xT`Q`x`)(1T2V −1 ` 12) ,

where V` is obtained from V by taking its elements lying on the intersections

of the ` rows and ` columns,

Q` = V−1` − V−1_` 121T2V −1 ` 1T 2V −1 ` 12 with 12 = (1, 1)T.

Proof. We must show reductio ad absurdum that if m = 1 then m(x`|MLS) =

∞ while if m = 2 then 0 < m(x`|MLS) < ∞. Suppose that m = 1, then we

only use one observation xi ∈ x for a training sample. It holds that

m(xi|MLS) = Z ∞ 0 Z ∞ −∞ f (xi|µ, σLS, MLS)πN(µ, σLS) dµ dσLS = Z ∞ 0 Z ∞ −∞ u−1_i (2π)1/2(σLS) −1 exp − (xi− µ) 2_/u2 i 2σ2 LS 1 σLS dµ dσLS = u −1 i (2π)1/2 Z ∞ 0 (σLS)−2 Z ∞ −∞ exp − (xi− µ) 2 2u2 iσLS2 dµ ! dσLS .

(21)

The inner integral in respect to µ has an integrand that is the kernel of the N (xi, u2iσLS2 ) probability density function. Hence,

m(xi|MLS) = u−1_i (2π)1/2 Z ∞ 0 (σLS)−2(2π)1/2uiσLS dσLS = Z ∞ 0 1 σLS dσLS = lim M →∞N →0lim ln|σLS| M N = ∞ − (−∞) = ∞ .

Since m(xi|MLS) diverges then one observation can not be used for

min-imal training sample. Suppose that m = 2, then we can use theorem 1 to show that m(xi, xj|MLS) converges and equals the following expression

m(x`|MLS) = 1 2 q (det(V`))(xTQ`x)(1T2V −1 ` 12) .

3.2 The Random Effects Model

Definition 5. Let V be a known positive definite matrix, µ an unknown location parameter and σRE an unknown scale parameter. Then the random

effects model MRE of the data X is described by (j), (jj) and (jjj).

(j) X = µ1 + λRE + εRE.

(jj) εRE ∼ N (0, V).

(jjj) λRE ∼ N (0, σRE2 I).

The components (j), (jj) and (jjj) of the random effects model can be combined to form the multivariate normal distribution X ∼ N (µ1, V+σ2_REI). Within the context of application the quoted uncertainties in V are under-stated because of the presence of heterogeneity. In the random effects model an adjustment for this heterogeneity is made to V by adding a common term σ2

RE to every element in the diagonal of V, where σ2LS represents the

(22)

added by σ2

RE, then only the variances u2i are adjusted for by the

hetero-geneity while the covariances in V are left unaltered. By using the notation X|µ, σRE, MRE we make it explicit that the data are conditioned on the two

parameters µ and σRE under the random effects model.

Definition 6. Under the random effects model the data X follow the multi-variate normal distribution N (µ1, V+σ2

REI) with probability density function

expressed as f (x|µ, σRE, MRE) = det(V + σ2 REI) −1₂ (2π)n/2 exp −1 2(x−µ1) T V+σ2_REI−1(x−µ1) .

The Berger & Bernado reference prior has been derived in [Bodnar, Link & Elster, 2016] and it is given by

πN(µ, σRE) ∝

q σ2

RE· tr((V + σRE2 I)−2) .

Theorem 3. Let n > 1 and V be positive definite. Then under the random effects model and reference prior

πN(µ, σRE) = πN(σRE) =

q σ2

RE· tr((V + σRE2 I)−2),

the marginal distribution of X is given by

m(x|MRE) = Z ∞ 0 (2π)−n−12 det(V + σ2 REI) −1/2 p1T_{(V + σ}2 REI)−11 exp−1 2x T_Q(σ2 RE)x πN(σRE) dσRE , where Q(σ_RE2 ) = (V + σ_RE2 I)−1− (V + σ 2 REI)−111T(V + σRE2 I)−1 1T_{(V + σ}2 REI)−11 .

Proof. The marginal distribution of the data X is obtained by integration of

the parameters µ and σREin the joint probability density function f (x, µ, σRE|MRE).

(23)

m(x|MRE) = Z ∞ 0 Z ∞ −∞ f (x|µ, σRE, MRE)πN(σRE) dµ dσRE = Z ∞ 0 Z ∞ −∞ det(V + σ2_REI)− 1 2 (2π)n/2 exp − 1 2(x − µ1) T _{V + σ}2 REI −1 (x − µ1) × πN(σRE) dµ dσRE .

From lemma 1 with (V + σ2

REI) substituted for V, we can rewrite the

exponent by exp − 1 2(x − µ1) T _{V + σ}2 REI −1 (x − µ1) = exp− 1 2x T_Q(σ2 RE)x exp − 1 21 T _{V + σ}2 REI −1 1 µ − 1 T _{V + σ}2 REI −1 x 1T _{V + σ}2 REI −1 1 2! = exp − 1 2x T Q(σ_RE2 )x exp − 1 2 µ − 1T _{V + σ}2 REI −1 x / 1T _{V + σ}2 REI −1 1 q 1T _{V + σ}2 REI −1 1 −1 !2! .

The exponent is proportional to the kernel of the N (ω, τ2_{) probability}

density function with location parameter ω = 1T V+σ2_REI−1x / 1T V+ σ2 REI −1 1 and variance τ2 ₌ 1T _{V + σ}2 REI −1 1 −1

. Integrating the expo-nent in respect to µ over its range yield the following.

Z ∞ −∞ exp − 1 2(x − µ1) T V + σ_RE2 I−1(x − µ1) dµ =√2π exp−1 2x T_Q(σ2 RE)x q 1T _{V + σ}2 REI −1 1 .

(24)

The proof is completed.

In appendix B we show that the marginal distribution m(x) can only be proper if the prior distribution πN_{(θ) is proper. Hence, if we can show that}

πN(µ, σRE|x`) is proper for some training sample x` of size m > 1, then by

the appendix B we know that m(x`|MRE) is proper and the minimal training

sample must be m = 2. It has been proven in [Bodnar, Link & Elster, 2016, p.32] that for the random effects model and the Berger and Bernado refer-ence prior, if n > 1 then πN_{(µ, σ}

RE|x) is proper. We can therefore establish

a theorem about the minimal training sample without a proof, since it has already been implicitly proven in the mentioned article.

Theorem 4. Let V be positive definite. Under the random effects model and reference prior

πN(µ, σRE) = πN(σRE) =

q σ2

RE· tr((V + σRE2 I)−2),

the size of a minimal training sample x` is given by m = 2, i.e., x` = {xi, xj},

where xi, xj ∈ x and i 6= j. Moreover, it holds that

m(x`|MRE) = Z ∞ 0 (2π)−12 det(V_`+ σ2 REI) −1/2 p1T 2(V`+ σRE2 I)−112 exp−1 2x T `Q`(σRE2 )x` πN(σRE) dσRE ,

where V` is obtained from V by taking its elements lying on the intersections

of the ` rows and ` columns, and

Q`(σRE2 ) = (V`+σ2REI) −1₋(V`+ σ2REI) −1₁ 21T2(V`+ σRE2 I) −1 1T 2(V`+ σRE2 I)−112 with 12 = (1, 1)T.

3.3 Location-Scale Model Vs. Random Effects Model

Both the location-scale model and the random effects model imply that the data X follow a multivariate normal distribution, that is X ∼ N (µ1, σ2

LSV)

and X ∼ N (µ1, V + σ_RE2 I), respectively. The difference between them is how they take into account for the heterogeneity. The choice stands be-tween letting the heterogeneity be modeled as an multiplicative correction

(25)

factor (location-scale model) or as an additive correction term (random ef-fects model).

We make a selection between the location-scale model and the random effects model according to which model that is favored by the IBF. The IBF for the two models is constructed in the following way

We have derived the marginal distributions of the location-scale model in closed form in theorem 1 and theorem 2. However, the marginal distri-butions of the random effects model are only partly derived in theorem 3 and theorem 4 and need further evaluation by numerical calculation. This numerical calculation is only a minor issue since we can calculate m(x|MRE)

and m(x`|MRE) relatively fast and with good accuracy with the Simpson’s

rule as explained in appendix D.

Both the location-scale model and the random effects model have been shown to have a minimal training sample size of m = 2. With a total sample size of n, we have a number of n(n − 1)/2 combinations to select two observations out of n. Let X` = (x1`, x2`, · · · , x

l−1 ` , x

l

`) be the set of

training samples, there are l = n(n − 1)/2 number of elements in X`. We

will compute the IBF for every such possible selection and summarize the findings in an average IBF and a median IBF. The reason is that the mean and median of the IBF over X` has increased stability in comparison with

the IBF of an arbitrary xi

`. If the sample size n is either to small (too

slight stability improvement) or to large (too long computing time), it is then possible to use the expected mean discussed in [Berger & Pericchi, 1996, p.113-114], although we don’t use it here. Let the average IBF be defined by

(26)

while the median IBF is given by

4 Numerical illustration

In this section we test the performance of the IBF with simulated sam-ples from a location-scale model and random effects model respectively. We draw a sample from the location-scale model X = µ1 + εLS where

εLS ∼ N (0, σLS2 V) and then we compute the IBF with this sample. By

doing so we test if the IBF is consistent, i.e., that the IBF can predict cor-rectly which model is true. Then we draw a sample from the random effects model X = µ1 + λRE + εRE where εRE ∼ N (0, V) and λRE ∼ N (0, σRE2 I)

and test the IBF the same way (see appendix E for the details of the simu-lation procedure).

For the simulation we let V be determined by the data for the gravita-tional constant (See appendix F). We assume µ = 0 in both models without loss of generality and set n = 16. We let σLS = 1.3 and thus σ2LS = 1.69,

which means that for the location-scale model X ∼ N (µ1, σ2

LSV). Then the

understated uncertainties in V are enlarged by a factor of 1.69 to adjust for the heterogeneity. We let σRE = 0.0004 and thus σRE2 = 1.6 × 10−7, so that

the uncertainties in V are enlarged by adding 1.6 × 10−7 to each variance in the diagonal of V. So σLS = 1.3 and σRE = 0.0004 cover roughly about the

(27)

same range of variation in X. L. S. Model Mean BI MREMLS(x(`)|x`) Median BI MREMLS(x(`)|x`) No.1 0.9178607 0.1063782 No.2 1.936334 0.1658802 No.3 0.04904022 0.01851217 No.4 3.100906 0.6824315 No.5 0.003090198 0.001818678

Table 1: The IBF for five different simulated samples from the location-scale model. We put µ = 0, σLS = 1.3 and V contains reported uncertainties for

the Newtonian constant of gravitation G.

R. E. Model Mean B_MI REMLS(x(`)|x`) Median B_MI REMLS(x(`)|x`) No.1 85.16479 8.664935 No.2 356.2071 186.6131 No.3 355.8205 128.8503 No.4 66.82962 34.00741 No.5 24249.18 9591.932

Table 2: The IBF for five different simulated samples from the random effects model. We put µ = 0, σRE = 0.0004 and V contains reported uncertainties

for the Newtonian constant of gravitation G.

We expect the IBFs in table 1 to be less than one, since this corresponds to support of the location-scale model when it is true. We see in table 1 that the median IBF is much more accurate at predicting the location-scale model. The reason why is due to the presence of outliers which cause the mean IBF to be much larger, sometimes even larger than one. Out of 120 unique minimal training samples only a few very large outliers were present while most of the other cases the IBF were far below one. In table 2 we expect the IBFs to be larger than one, since this correspond to support of the random effects model when it is true. All five trials gave consistent results, they are however somewhat unstable like in table 1.

(28)

5 Empirical applications

5.1 Newtonian constant of gravitation

In the year 1687 Newton published his famous principia in which he pre-sented his thesis on the law of general gravitation. Newton proposed that the force that cause an apple to fall from a tree is the same kind of force that keep the moon to orbit the earth, this would be the force of gravita-tional attraction. In most textbooks in physics this law is written by an equation F = Gm1m2

r2 . The constant of proportionality, G, is called the

gen-eral constant of gravitation and is a natural constant. That means that this constant is the same throughout the whole universe. It determines the strength of the gravitational force, given mass and distance. Imagine two objects being put out in space far away from any stars and planets. Each object have the mass 1 kg and they are put at a distance of 1 m from each other, then the objects exerts a gravitational force on each other by exactly F = G ≈ 0.000 000 000 066 7 N = 6.67 · 10−11N. Compare this to the force of weight of a bottle of water of 1 kg which is 9.8 N. The weight of a feather, 1 gram, correspond to 0.0098 N. G / ( 10 − 11 m 3 k g − 1 s − 2 ) 6.671 6.672 6.673 6.674 6.675 6.676

NIST−82 LANL−97 BIPM−01 MSL−03 UZur−06 JILA−10 LENS−14 HUST−TOS−18 TR&D−96 UWash−00 UWup−02 HUST−05 HUST−09 BIPM−14 UCI−14 HUST−AAF−18

Figure 1: The gravitational constant G. The red diamonds are the estimates and the blue intervals are the uncertainties, data from [Merketas et al, 2019].

(29)

The average and median IBFs for the data on the gravitational constant are given by Mean(B_MI REMLS(x(`)|x`)) = 0.8673046 , and Median(B_MI REMLS(x(`)|x`)) = 0.4157501 . −1 0 1 2 3 REM to LSM Training sample log Intr insic Ba y es F actor

Figure 2: The IBF for each of the 120 minimal training samples. The y-axis display log(BI

MREMLS(x(`)|x`)). Then negative values correspond to an IBF

less than one, while positive values correspond to an IBF greater than one.

Both the average IBF and the median IBF clearly indicate the support of the location-scale model in comparison to the random effects model. As seen in figure 2, except for a few outliers the data consistently indicate that the IBF is less than one and, hence, support the location-scale model.

(30)

6 Discussion

In view of the gravitational constant G data set, the IBF clearly favor the location-scale model before than the random effects model. The mean IBF was only slightly under one and seem to be strongly influenced by outliers since most IBFs were far under one. It is typical for averages to be sensitive to outliers. Since the spectrum of IBFs in figure 2 shows only a few outliers, we think that the median IBF represents the overall IBF better than the average IBF.

The IBF is the ratio of marginal distributions for respective model and the marginal distribution is a measure for fit of the model to the data. A model that fits the data well does not necessarily predict future outcomes well. When a model is overfitted to the data the parameter uncertainties are large. It is our assumption that the compared models are not overfitted and that therefore the model with highest marginal distribution predict future outcomes best. In the introduction we discussed the robustness of respective model to model misspecification and it was found that if the location-scale model was the wrong model, then it made a bad estimate of µ. If on the other hand the random effects model was the wrong model, it still persisted in making a good estimate of µ (95% coverage probability). This finding gives support to the random effects model, not the location-scale model that is favored by the IBF.

In reality however, probably neither the location-scale model nor the ran-dom effects model is the actual ’true model’. Our results only shows that the location-scale model is favored compared to the random effects model, but what about other models? Both the location-scale model and the random ef-fects model are however good models for data inconsistency or heterogeneus data.

(31)

7 Appendices

7.1 A: Frequentist statistics Vs. Bayesian statistics

In frequentist statistics we specify a probability density function f (x|θ) of the data X given the parameter θ. The data x are known while the param-eter θ is unknown and considered to be dparam-eterministic. Typically we want to estimate θ by some estimator ˆθ which is a function of the data ˆθ = ˆθ(x). The estimator ˆθ is a random variable which has a sampling distribution, an expec-tation E(ˆθ) and a variance V (ˆθ). It is important to note that in frequentist statistics probabilities are attached to the estimator ˆθ(x), not the parameter θ. The frequentist confidence interval is said to cover the true value of θ by a probability 100(1 − α)%, where α is the significance level. This 100(1 − α)% probability has a repeatability interpretation; that 100(1 − α)% of the times that we collect data under identical conditions and compute the confidence interval will it cover the true value of θ. The confidence interval is an in-terval estimator, for example ˆθ ± λα

q ˆ

V (ˆθ), meaning that its boundaries are random. Confidence intervals are easily missunderstood. A typical mistake is to think that it is θ that lies inside the confidence interval by the prob-ability 100(1 − α)%, but this is simply false since θ actually does not alter between different values, it is constant. The frequentist principle is in qoute: ”In repeated practical use of a statistical procedure, the long run average actual accuracy should not be less than (and ideally should equal) the long run average reported accuracy” [Bayarri & Berger, 2004, p.60].

In Bayesian statistics we specify a probability density function f (x|θ) of the data x and a prior probability density function π(θ) for the parameter θ. The full probability distribution is then given by the joint distribution f (x, θ) = f (x|θ)π(θ). The data x is known and the parameter θ is unknown. The difference is that in Bayesian statistics the parameter θ is no longer considered a fixed constant as it is in frequentist statistics. In Bayesian statistics the parameter θ is a random variable with a probability density function π(θ). By using Bayes theorem you can obtain the probability dis-tribution of the parameter θ conditioned on the data x, called the posterior distribution p(θ|x).

(32)

p(θ|x) = f (x, θ)π(θ)

m(x) =

f (x|θ)π(θ)

R f (x|θ)π(θ)dθ ∝ f (x|θ)π(θ) .

The prior distribution π(θ) is specified before the sample x is collected and it corresponds to the statisticians prior beliefs on which values are prob-able for θ. The statisticians prior beliefs are not founded on observed data and are therefore subjective and likely to be biased. If the statistician wants to pick a prior distribution π(θ) objectively he may choose a noninformative prior distribution. The principle behind noninformative prior distributions is that they should only make ’vague’ probability statements about θ. We can interpret this as that no value for θ should be favored more or less than any other value for θ. The uniform distribution of θ satisfies this condi-tion. It is however common that you want to transform your parameters by some function. But if you do then the set uniformity of outcome is lost. The so called Jeffreys prior distribution is a solution to this problem. The Jeffreys prior is invariant to transformations of the parameter θ and is in this sense ’neutral’. Another kind of noninformative prior distribution is called the reference prior. The reference prior is determined so that the in-formation gained about θ in observing one observation x is maximized. This correspond to a minimally informative prior distribution about θ. Because if the prior distribution is much informative about θ, then we learn little about θ in observing x since much information was already contained in the prior distribution. The Kullback-Leibler divergence and the maximization of the Shannon information are involved in the definition of the reference prior [Berger, Bernado & Sun, 2009]. There is a strong relationship between the Shannon information and the concept of entropy, and an interesting perspec-tive on the reference prior is that it can be thought of as having achieved the highest entropy (fragmentization and disorder) of all the prior distributions.

In Bayesian statistics we estimate θ by the estimator ˆθ = Ep(θ|x)(θ|x).

Alternatively we may use the median or the mode of the posterior distri-bution p(θ|x) to estimate θ. This means that in Bayesian statistics we use the posterior distribution to estimate θ and we consider θ to be a random variable. In frequentist statistics we would instead have used the sampling distribution of the estimator ˆθ(x) to estimate θ, where then θ is considered to be a nonrandom constant.

(33)

In Bayesian statistics we use the credible interval as an interval estimator of θ, not confidence intervals. The credible interval is an interval such that P (θL < θ < θU|x) = (1 − α), hence the credible interval is not uniquely

defined. A common choice is to simply let the credible interval be described by Ep(θ|x)(θ|x) ± λαpVp(θ|x)(θ|x) in resemblance with the frequentist

con-fidence interval. The credible interval has fixed boundaries, it is θ that is random and lie inside the credible interval by the probability 100(1 − α)%. Interestingly, this is the interpretation that is commonly given to confidence intervals, which is entirely erroneous. However, the fact that it is common to mistake confidence intervals of being credible intervals speaks in favor of credible intervals being more ’natural to the mind’ than are confidence in-tervals. Ultimately, frequentist statistics has historically been favored before Bayesian statistics because the former is conditioned on less assumptions and is computationally easier than the latter. But with the advancement of computer devices and the introduction of objective Bayesian approaches, Bayesian statistics has become a school of thought to be reckoned with.

7.2 B: The necessity for a proper prior distribution

A prior distribution is said to be proper when R π(θi)dθi = 1 (or in general

R π(θi)dθi < ∞). To see the reason why prior distributions must be proper

when using Bayes factors consider the following. Z x∈Ω mi(x)dx = Z x∈Ω Z θi∈Θi fi(x|θi)πi(θi)dθidx = 1 .

Since fi(x|θi) is a probability density function it integrates to unity in

respect to x. Change the order of integration.

Z θi∈Θi Z x∈Ω fi(x|θi)πi(θi)dxdθi = Z θi∈Θi πi(θi) Z x∈Ω fi(x|θi)dx dθi = Z θi∈Θi πi(θi)dθi = 1 .

Hence the prior distribution must be proper if the marginal distribution is to integrate to unity.

(34)

7.3 C: Training samples in the IBF

In section 2.2 we discussed IBFs and minimal training samples. Here fol-lows some discussion and examples to clarify the concepts therein presented. Consider the following two definite integrals a) and b):

a) lim L→∞ Z L 1 1 x dx = limL→∞ ln|x| L 1 = ∞ . b) lim L→∞ Z L 1 1 x2 dx = lim_L→∞ − 1 x L 1 = 1 .

Why does the definite integral a) diverge whose integrand is 1/x while b) converge whose integrand is 1/x2_{? The reason is that the integrand of a)}

does not go to zero fast enough with increasing values for x. In view of the integral being a sum, the cumulative contributions of 1/x for ever increasing L is too large for the integral to be upper bounded and hence the integral diverges. While the integrand 1/x2 _{in b) does go to zero fast enough with}

increasing values of x and hence the integral converges to unity.

With this in mind, consider a case when x1, . . . , xn iid

∼ N (0, σ2_{), then the}

noninformative reference prior of σ is πN_{(σ) ∝ 1/σ where σ ∈ (0, ∞). Let}

us check if the reference prior integrates to unity:

Z ∞ 0 πN(σ) dσ = Z ∞ 0 1 σ dσ = limM →∞N →0lim ln|σ| M N = ∞ − (−∞) = ∞.

Now, let us use one observation xi from the data set as a training sample

and see how the posterior prior distribution πN_(σ|x

i) is different from πN(σ). πN(σ|xi) ∝ f (xi|σ)πN(σ) = 1 √ 2πσexp − x 2 i 2σ2 1 σ ∝ (σ)−(1+1)exp − x 2 i/2 σ2 .

After a suitable change of variables the final expression can be seen to be the kernel of the inv-gamma(1/2, x2

(35)

Z ∞ 0 πN(σ|xi) dσ = 1 √ 2π Z ∞ 0 (σ)−(1+1)exp − x 2 i 2σ2 dσ =       σ2 = ζ dσ = 1 2√ζdζ σ = 0 ⇔ ζ = 0 σ = ∞ ⇔ ζ = ∞       = 1 2√2π Z ∞ 0 ζ−(12+1)exp − x 2 i/2 ζ dζ = 1 2√2π Γ(1₂) (x2 i/2)1/2 = 1 2xi < ∞.

Hence the noninformative prior distribution πN_{(σ) can be made proper}

by determining the posterior πN(σ|xi) based on a single observation and the

prior πN_{(σ). This posterior is then used as a prior for other elements in the}

sample. Moreover, we get R₀∞cπN_(σ|x

i) dσ = 1 with c = 2xi.

The difference between the prior distribution πN_{(σ) and posterior prior}

distribution πN(σ|xi), is that the former is flatter in distribution than the

latter. The prior distribution is flat in the sense that the mass of probability is too spread out across the range of σ. The problem is then the same for πN(σ) as with the integrand in the definite integral in a). However, since the posterior prior distribution πN_(σ|x

i) have enough of its mass of probability

gathered together it integrates to a finite value.

7.4 D: The numerical integration

In theorem 3 is provided an analytically derived version of the marginal distribution for the random effects model up to one integral which depends on σλ. This integral needs to be treated numerically which will be the topic of

this appendix. For simplicity of notation, let the integral from theorem 3 be denoted by I = m(x|MR) =

R∞

0 f (σλ) dσλ and the corresponding numerical

calculation be denoted by ISimpson. We use the Simpson rule to numerically

(36)

of the Trapezoidal rule. The Simpson rule computes the integral of the function f (σλ) defined in the interval σλ ∈ (a, b). It requires that the interval

(a, b) first be partitioned into n subintervals of equal length l such that a−b = n · l. Let σλi = a + i · l and f (σλi) = fi, then I can be numerically calculated

by the Simpson rule in the following way

ISimpson = l 3 f1+ 4f2+ 2f2+ 4f3+ 2f2+ · · · + 2fn−2+ 4fn−1+ fn .

The range of σλ ∈ (0, ∞) imply that a = 0 and b = ∞. We will follow the

method of [Bodnar, Muhumuza & Possolo, 2020] and make the transforma-tion σλ = tan(ω) which is a one-to-one map of the interval ω ∈ (0, π/2) into

σλ ∈ (0, ∞). This allows us to compute the integral over the bounded range

ω ∈ (0, π/2) instead over the unbounded range σλ ∈ (0, ∞). The Jacobian

for this change of variable is |_dωdtan(ω)| = 1/cos(ω)2.

I = Z ∞ 0 (2π)−n−12 det(V + σ2 REI) −1/2 p1T_{(V + σ}2 REI)−11 exp− 1 2x T_Q(σ2 RE)x 1 σRE dσRE = Z π/2 0

(2π)−n−12 det(V + tan2(ω)I)−1/2

p1T_{(V + tan}2_(ω)I)−1₁ exp− 1 2x T_Q(tan2_(ω))x tan(ω)cos2_(ω) dω . where

Q(tan2(ω)) = (V + tan2(ω)I)−1− (V + tan

2_(ω)I)−1₁₁T_{(V + tan}2_(ω)I)−1

1T_{(V + tan}2_(ω)I)−1₁ .

A numerical calculation of an integral is not exact and some error, however small, is bound to follow. It is common lingo in mathematics to say that a nu-merical calculation is only an approximation to the corresponding analytical calculation. An approximation can easily be confused with an estimation and vice versa. The word ’approximation’ must not be taken as synonymous with the word ’estimation’ because these two words carry different associations. The main difference lie in the nature of the errors for respective concept. For example, the error of an approximation specify the exactness by which an integral has been calculated. Let ε denote the error of an approximation, then I = ISimpson± ε which implies that I ∈ (ISimpson− ε, ISimpson+ ε). In

(37)

this sense it is said that the transcendental number π is approximately 3.14, which means that π is known up to two decimals and have an error in the magnitude of ε ∼ 10−3. At the other hand however, the error of an estima-tion specify a precision, which is reflective of an unbounded variability. In other words, an estimator can take on any value in its range regardlessly of its precision. Although some values will be increasingly unlikely to be taken by the estimator as the precision increases.

The Simpson rule will be implemented in Appendix H for the code in R about the computation of the IBF. For the gravitational constant we have convergence up to six decimals of the integral Isimpsonat N = 50 000 intervals.

7.5 E: The simulation

Consider a random variable Z = [Z1, · · · , Zd]T that follow a multivariate

nor-mal distribution with mean vector µ = [µ1, . . . µd]T and (d × d)-dimensional

covariance matrix Σ, Z ∼ N (µ, Σ). Let b be a constant and D be a known square matrix, then the transformation T = DZ + b follow the multivariate normal distribution with mean vector Dµ + b and covarance matrix DΣDT_,

T ∼ N (Dµ + b, DΣDT_{) [Rizzo, 2008, p.70-72]. This is the information we}

need to simulate a random sample from εLS, εRE and λRE.

Start by the random effects model, given that we know µ and σRE. Since

the covariance matrix V is positive definite we can use Cholesky decomposi-tion to write V = CT_{C. For ε}

RE ∼ N (0, V) first generate iid standard

nor-mal observations, that is Z = [Z1, ..., Zd]T where Z1, . . . , Zd iid

∼ N (0, 1). This is the same as generating one observation z from Z ∼ N (0, I), since a diago-nal matrix I correspond to mutually independent variables Z1, . . . , Zd. Then

we obtain a simulated sample for εRE by the transformation εRE = CTZ.

Because then εRE ∼ N (CT0 + 0, CTI(CT)T) = N (0, CTC) = N (0, V). We

can then similarly generate a random sample for λRE ∼ N (0, σRE2 I). We

do this by first generating iid normally distributed observation. This corre-sponds to a generated sample from λRE ∼ N (0, σ2REI). Then a simulated

(38)

Then we have the location-scale model, given that we know µ and σLS. A

simulated sample for εLS ∼ N (0, σLS2 V) can be generated by first generating

iid random samples Z1, . . . , Zd iid

∼ N (0, σ2

LS) since this corresponds to an

observation from Z ∼ N (0, σ2

LSI). Then we make the transformation εLS =

σLSCTZ, because then εLS ∼ N (CT0 + 0, σLS2 CTI(CT)T) = N (0, σLS2 V).

Then a simulated sample from the location-scale model is µ1 + εLS.

7.6 F: Symbols and clarifications

X is a random variable.

x is a scalar, a realization of X.

If A is some matrix, than AT _{is its transpose.}

X = [X1, X2, . . . , Xn]T.

x = [x1, x2, . . . , xn]T.

1 = [1, 1, . . . , 1]T_.

I is the idendity matrix.

V =      u2 1 ρ12u1u2 · · · ρ1nu1un ρ21u2u1 u22 · · · ρ2nu2un .. . ... . .. ... ρn1unu1 ρn2unu2 · · · u2n      =      u1 0 · · · 0 0 u2 · · · 0 .. . ... . .. ... 0 0 · · · un           1 ρ12 · · · ρ1n ρ21 1 · · · ρ2n .. . ... . .. ... ρn1 ρn2 · · · 1           u1 0 · · · 0 0 u2 · · · 0 .. . ... . .. ... 0 0 · · · un      = U1/2ΩU1/2 .

εLS represents the heterogeneity in the location-scale model and is a

(39)

εRE represents the variation in X dependent only on V and is a (n × 1)

column random vector.

λRE represents the heterogeneity in the random effects model and is a

(n × 1) column random vector.

µ represents the parameter of interest and is a scalar random variable. σLS and σRErepresents the unexplainable interlaboratory heterogeneity

in the location-scale model and the random effects model, respectively. They both are scalars and are considered as nuisance parameters. θ is an arbitrary parameter, scalar or vector.

Θ is the parameter space of θ. π(θ) is a prior distribution for θ.

πN_{(θ) is a noninformative prior distribution for θ and we assume it to}

be improper unless otherwise stated.

For example a multivariate normal distribution X ∼ N (µ, Σ), where X has dimension d, has the probability density function:

f (x|µ, Σ) = 1 (2π)d/2_pdet(Σ)exp − (x − µ) T_Σ−1 (x − µ) 2 .

The probability density function of εLS ∼ N (0, σ2LSV) is then:

f (εLS|0, σLS2 V) = 1 (2π)d/2_σn LSpdet(V) exp−ε T LSV −1_ε LS 2σ2 LS .

7.7 G: Data on the Newtonian constant of gravitation

The source of the data on the gravitational constant G is [Merketas et al, 2019] and we express the estimates and uncertianties of G as below:

(40)

Study Estimate (G/10−11m3_kg−1 s−2) Uncertainty NIST-82 6.67248 0.00043 TR&D-96 6.6729 0.00050 LANL-97 6.67398 0.00070 UWash-00 6.674255 0.000092 BIPM-01 6.67559 0.00027 UWup-02 6.67422 0.00098 MSL-03 6.67387 0.00027 HUST-05 6.67222 0.00087 UZur-06 6.67425 0.00012 HUST-09 6.67349 0.00018 JILA-10 6.67260 0.00025 BIPM-14 6.67554 0.00016 LENS-14 6.67191 0.00099 UCI-14 6.67435 0.00013 HUST-TOS-18 6.674184 0.000078 HUST-AAF-18 6.674484 0.000078

7.8 H: Program code in R

(41)

################################################################################## ###################################

# Data + some definitions #

################################################################################## ################################### X_gravity <- c(6.672480,6.672900,6.673980,6.674255,6.675590,6.674220,6.673870,6.672220,6.674250,6.673490, 6.672600,6.675540,6.671910,6.674350,6.674184,6.674484) ### observations u <- c(4.3e-04,5.0e-04,7.0e-04,9.2e-05,2.7e-04,9.8e-04,2.7e-04,8.7e-04,1.2e-04,1.8e-04,2.5e-04,1.6e-04,9.9e-04,1.3e-04,7.8e-05,7.8e-05) ### uncertainties

n<-length(u) ### number of observations bone<-rep(1,n) ### vector of 1:s

In<-diag(n) ### identity matrix

Jn<-matrix(1,n,n) ### matrix of 1:s = bone'*bone

V<-diag(n) ### for the correlation matrix. Now: diagonal u2<-u^2

U<- diag(u)%*%V%*%diag(u) ### the correlation matrix

C_REM_n<-(2*pi)^(-0.5*(n-1)) ### constant of normalization, REM C_LSM_n<-(pi)^(-0.5*(n-1)) ### constant of normalization, LSM

################# numerical integration, definiton of nodes and weights N<-100 ### number of intervals

h <- (pi/2)/(N-1) ### interval length

y<-seq(0,pi/2,h) # Nodes ### to be 'x-values' (=sigma_REM) tan_y<-tan(y[2:N]) ### inserted into tan formula cos_y2<-cos(y[2:N])^2 ### inserted into cos^2 formula

(42)

w[seq(3,N-2,2)] <- 2 w <- w*h/3 # Weights

######################################################## training samples l<-2 ## two observation is enough for propriety

bone_l<-rep(1,l) ### vector of 1:s for training sample Il<-diag(l) ### identity matrix for training sample

Jl<-matrix(1,l,l) ### matrix of 1:s = bone'*bone for training sample set_l<-combn(1:n, l) ### all possible combinations of set with two elements length_l<-ncol(set_l) ### number of combinations

C_REM_l<-(2*pi)^(-0.5*(l-1)) ### constant of normalization for training sample, REM C_LSM_l<-(pi)^(-0.5*(l-1)) ### constant of normalization for training sample, LSM

################################################################################## ##########################

# Simulation, Cholesky decomposition ## general constants

mu<-0 ###value for mu

sigma_LS<-1.5 ###value for sigma_{LSM} sigma_RE<-0.02 ###value for sigma_{REM}

cholU<-t(chol(U)) ###square root of U based on Cholesky decomposition;

###the function t() is used since R gives the cholesky-root as an upper triangular matrix

# Simulating a sample from the LSM # X_sim_LS<- rep(mu,n)+sigma_LS*cholU%*%rnorm(n)

# Simulating a sample from the REM # X_sim_RE<- rep(mu,n)+sigma_RE*rnorm(n)+cholU%*%rnorm(n)

Bayesian Model Selection with Intrinsic Bayes Factor for Location-Scale Model and Random Eﬀects Model