Multiple comparison procedures based on marginal p-values

(1)

Multiple comparison procedures based on marginal p-values

Martin Ekenstierna

U.U.D.M. Project Report 2004:12

Examensarbete i matematisk statistik, 20 poäng Handledare: Olivier Guilbaud, AstraZeneca

Examinator: Silvelyn Zwanzig Juni 2004

Department of Mathematics

Uppsala University

(2)

Abstract

When testing more than one hypothesis at the same time the probability of making at least one type I error increases. In this paper some methods are presented that control that probability to be smaller than the desired overall level α , no matter how many (or which) of the hypotheses are false. This paper will mainly deal with methods based on marginal p-values and these methods are of

considerable practical importance. Sample size calculations are illustrated for two of the methods,

the Bonferroni procedure and the Bonferroni-Holm procedure.

(3)

Acknowledgment

I would like to thank Olivier Guilbaud at AstraZeneca for his engagement and all his advice and

help he gave me.

(4)

1. INTRODUCTION 4

2. BASIC CONCEPTS AND TERMONOLOGY FOR MCPS 5 2.1 Family of (null-) hypotheses H _i considered 5 2.2 P-values 6

2.3 Familywise error rate (FWER) of type I errors – weak and strong control 6 2.4 Power 7 3. MCPS VALID QUITE GENERALLY 8

3.1 The Bonferroni procedure 8

3.2 Bonferroni-Holm’s procedure 8

3.3 Shaffer’s improvement of Bonferroni-Holm 9 3.4 Fixed sequence procedure 10

3.5 Wiens’ generalization of the fixed sequence procedure 11

3.6 Fixed sequence procedure for groups of hypotheses 12

4. A GENERAL MCP: THE CLOSED TEST PROCEDURE 13

5. MCPS VALID UNDER CERTAIN INDEPENDENCE/DEPENDENCE ASSUMPTIONS14 5.1 Hochberg’s procedure 15

5.2 Hommel’s procedure 16

6. POWER AND SAMPLE SIZE 17

6.1 Illustration 1: Bonferroni and Power concept (3) 18

6.2 Illustration 1: Bonferroni-Holm and Power concept (3) 20

APPENDIX 1: Proof of strong control for the fixed sequence procedure (FS) 24

APPENDIX 2 24

APPENDIX 3 25

REFERENCES 26

(5)

1. Introduction

In clinical studies various multiplicity problems do occur. For example, there may be various comparisons that are of interest. Such comparisons may concern several response-variables, more than two treatments, and/or several sub-groups of subjects.

When testing a single hypothesis, a type I error is made if a hypothesis is rejected although the hypothesis is actually true. The probability of making such an error is often controlled to be smaller than a certain level α . If several hypotheses are tested, a type I error can be made for each

hypothesis. The probability of making at least one type I error then increases, often sharply, with the number of hypotheses. That is, there is bigger chance to reject a true hypothesis erroneously, and that is not good. The methods that are described in this text make sure that the probability of making at least one type I error is smaller than a certain given overall level α .

In some situations there may be some variables and/or comparisons that are more important than others. Some of the methods described in this text take this aspect into account.

This paper will mainly deal with methods based on marginal p-values associated with the hypotheses of interest. These methods are of considerable practical importance because of their simplicity, flexibility, and general applicability.

General references about multiple comparison procedures (MCPs) are Hochberg & Tamhane (1987), Westfall & Young (1993), and Hsu (1996).

In section 2, some basic concepts and terminology are introduced. Section 3 deals with some MCPs

that are valid quite generally, without any assumption about independences or dependences between

the tests. Section 4 deals with a very general MCP that also is valid quite generally – the closed test

procedure. Section 5 deals with two MCPs that requires independent tests (or at least a certain kind

of dependence). Finally, in section 6, some power and sample size calculations are illustrated that

are based on simple approaches which are not computer intensive.

(6)

2. Basic concepts and terminology for multiple comparison procedures (MCPs)

2.1 Family of (null-) hypotheses H _i considered

The family of (null-) hypotheses

H ¹ , H ₂ , K , H _n

considered in this context consists of the set of hypotheses of interest that you want to make significance statements about, and jointly control errors. As when testing a single hypothesis, the (null-) hypotheses should be the opposite of what you want to show. The null hypotheses may concern distinct variables, parameters, subgroups, etc.

Let k be the number of true hypothesis in this family in a given situation. Of course this number is unknown but it is nevertheless sometimes possible to say something about it. Consider the

following examples, where the means µ can be thought of as true mean responses in a parallel group study.

Example 2.1:

C B

C A

B A

H H H

µ µ

=

: : :

3 2 1

This is a family with n = 3 null hypotheses. The number k of true hypotheses can be 0,1 or 3; but not 2, because, for example, if H ₁ and H ₂ are true then H ₃ must also be true.

Example 2.2:

H ₁ : µ _A ≥ µ _B H ₂ : µ _A ≥ µ _C H ₃ : µ _B ≥ µ _C

In this example the number k of true hypotheses can be 0, 1, 2 or 3.

Example 2.3:

(7)

H ₁ : µ _A − µ _B ≤ 0 H ₂ : µ _A − µ _B ≥ 0

In this example the number k of true hypotheses can be 1 or 2; but not 0, because both can’t be false.

Example 2.4:

H ₁ : µ _A = µ _Placebo H ₂ : µ _B = µ _Placebo H ₃ : µ _C = µ _Placebo

In this example the number k of true hypotheses can be 0, 1, 2 or 3.

2.2 P-values

Briefly, a p-value is the probability under a null hypothesis of observing a test statistic for that is as extreme as, or more extreme than, the observed value in the direction of rejection.

Such a p-value gives more information than simply a reject or accept decision about at some level

p i H _i

H i

p i H _i

α , and it can be seen as the level at which would just barely be rejected. The p-value is a random variable (since it depends on the outcome of the test statistic), and it satisfies

H i p _i

[ ^p i ≤ ^u | ^H i is true ] ≤ ^u Pr

for any given . With test statistics that have a continuous distribution, equality holds in the last inequality ≤ .

1 0 < u <

2.3 Familywise error rate (FWER) for type I errors – weak and strong control

When testing a single hypothesis , a type I error is made if is rejected although actually is true. The probability of making such a type I error with a test of is usually controlled to be

H i H _i H _i

H i ≤

a certain level α , typically equal to 0.05.

When there are several hypotheses, H ¹ , H ₂ , K , H _n , you also want to control the type I error at

some level α . A type I error is then made if one or more of the true hypotheses in the family is

(8)

rejected. The methods to be discussed control the familywise error rate (FWER), which is defined as the probability of making at least one type I error in the family. This control can either be weak or strong. Weak control means that the FWER is ≤ given that all null hypotheses in the family α are true. Strong control means that the FWER is ≤ no matter how many or which null hypotheses α in the family are true. Weak control is usually not satisfactory in practice, because it is often the case that some hypotheses can be true, whereas others can be false.

α

H _n 2.4 Power

When testing a single null hypothesis H _i at level through its p-value , the power is the probability

p i

[ p i ^≤ ^α ]

Pr of rejecting the null hypothesis when actually is false. The power is equal to

H i

β

−

1 where β equals the probability of accepting although is false (type II error).

One way to improve the power is often to increase sample sizes, and it is then possible to calculate the sample size that is needed to get a specific level of power. This is often rather straightforward with a single hypothesis , but less so with multiple comparisons.

H i H _i

H i

When there are several hypotheses, H ¹ , H ₂ , K , , in the family of interest, there are three different definitions of power that are common:

(1) The probability of rejecting at least one false hypothesis (2) The average probability of rejecting the false hypotheses (3) The probability of rejecting all false hypotheses

Definition (3) is ideally the best one even though it may require a big sample to achieve the wanted level of power. Definition (1) is compatible with multiple comparison methods that control the FWER, but in the sample size calculations illustrated in this paper, definition (3) will mainly be considered.

3. MCPs valid quite generally

3.1 The Bonferroni procedure

(9)

The classical Bonferroni procedure is very simple:

Reject all hypotheses H _i in the family, H ¹ , H ₂ , K , H _n , that have a p-value p _i ≤ α / n where n is the number of hypotheses H _i .

This is the unweighted version of Bonferroni.

There is also a weighted version where p _i is compared to α _i , where the α _i ’s satisfy ∑ ^{. The}

unweighted version just mentioned corresponds to the special case with each

α _i = α n

i α / α = .

The control of the FWER is strong in that the FWER is ≤ no matter of how many or which α _i ’s in the family are true; see section 2.3. However, the problem with the Bonferroni procedure is that it is rather conservative in that it may not reject that many false ’s since it requires very small p- values when the number of hypotheses in the family is large.

H

H _i n

3.2 Bonferroni-Holm’s procedure

This stepwise (step down) procedure is based on ordered p-values, and corresponding hypotheses are rejected one at a time. It is a step-down procedure in that one starts with the most extreme (that is, the smallest) p-value, and continues with less extreme p-values, in the successive rejection decisions. This procedure was proposed by Holm (1979). Let p ⁽¹⁾ < p ₍₂₎ < K < p (n ) be the ordered p-values and H ₍ ₁ ₎ , H ₍ ₂ ₎ , Κ , H ₍ _n ₎ be the corresponding null-hypotheses.

Step 1: Look at the smallest p-value p ₍₁₎

If it is ≤ α / n then reject H ₍₁₎ and go on to step 2 If it is > α / n then accept H ₍ ₁ ₎ , H ₍ ₂ ₎ , Κ , H ₍ _n ₎ and stop

…

Step j: Look at the j:th smallest p-value p _{( j )}

If it is ≤ α /( n − j + 1 ) then reject H _{( j} ₎ and go on to step j +1

If it is > α /( n − j + 1 ) then accept H ^{( j)} , K , H (n ) and stop

(10)

…

Step n: Look at the largest p-value p _{(n )} If it is ≤ then reject α H _{(n )} and stop If it is > then accept α H _(n ₎ and stop

The Bonferroni-Holm procedure also controls the FWER in the strong sense. The Bonferroni-Holm procedure is not as conservative as the classical Bonferroni procedure in that it rejects more, and it can always be used instead of the classical Bonferroni procedure (for multiple tests based on individual p-values).

3.3 Shaffer’s improvement of Bonferroni-Holm

There is a possibility to improve the Bonferroni-Holm procedure (so it rejects more) when there are logical relations between the hypotheses H ¹ , H ₂ , K , H _n in the family of interest. Shaffer (1986) proposed that improvement. Suppose it is known that k ∈ K ⊂ 0,1,K ,n { } , where as before, k is the number of true hypotheses in the family. For example, note that it is known that in example 2.1 in section 2.1, and in example 2.3. In the Bonferroni-Holm procedure the denominator in the j:th step that divides

} { ⁰ ^, ¹ ^, ³

= K

} { 1 , 2

= K

α is n − j +1 t _j

, whereas in Shaffer’s improvement, a smaller denominator is used. This denominator is defined as: t _j

t _j = max r ∈ K; r ≤ n − j +1 { } ^for j =1,K , n

It follows from this definition that t _j ≤ n − j +1 , so more hypotheses can be rejected.

This may seem a bit cryptic but consider the family of three hypotheses in example 2.1 in section 2.1:

C B

C A

B A

H H H

µ µ

=

: : :

3

2

1

(11)

Here it is known that the number of true hypotheses in the family satisfies , and the denominators are thus given by:

k k ∈ K ≡ 0,1, 3 { }

t _j

t ₁ = max r ∈ K; r ≤ 3 { } ^{= 3}

t ₂ = max r ∈ K; r ≤ 2 { } ⁼¹

t ₃ = max r ∈ K; r ≤1 { } ⁼¹ ^.

So, in step one p ₍₁₎ would be compared with α / 3 , in step two p ₍₂₎ with α , and in step three p ₍₃₎ also with α . There is no reason to protect against error in the case that two hypotheses are true and one is false in step two since that combination of true and false hypotheses can’t happen.

The type-I FWER is also strongly controlled to be ≤ α in Shaffer’s improvement of the Bonferroni- Holm procedure. Shaffer (1986) also proposed some other improvements.

3.4 Fixed sequence procedure

The Fixed sequence procedure is also a stepwise procedure. This is an important simple procedure that is much used in practice. It is assumed that the hypotheses are ordered in a sequence, by interest and/or anticipated high initial rejection probabilities. Let the order H ¹ , H ₂ , K , H _n be fixed (and prespecified). Then, successively test each hypothesis at level α until the first non-significant hypothesis occurs:

Step 1: Look at p-value p ₁ corresponding to H ₁

If it is smaller than α reject H ₁ and go on to step 2 If it is larger than α accept H ¹ , H ₂ , K , H _n and stop

Step 2: Look at p-value p ₂ corresponding to H ₂

If it is smaller than α reject H ₂ and go on to step 3 If it is larger than α accept H ² , K , H _n ^{and stop}

Step n: Look at p-value p _n corresponding to H _n

If it is smaller than α reject H _n and stop

If it is larger than α accept H _n and stop

(12)

Note that unordered p-values are used in the steps, in contrast to the Bonferroni-Holm and Shaffer procedures.

The procedure controls the FWER in the strong sense (proof in appendix 1). A nice aspect of this procedure is that all p-values are compared to α , so it seems as if no multiplicity correction is done.

The problem is however that you may stop rejecting “too early”, as you stop when you come to a non-significant p-value, even if subsequent p-values may be very small.

3.5 Wiens’ generalization of the fixed sequence procedure

Wiens (2003) proposed a generalization of the fixed sequence procedure with which it is possible to continue even if a non-significant p-value is encountered. As in the fixed sequence procedure, let the order H ¹ , H ₂ , K , H _n be fixed (and prespecified) and assign α _i ′ to each hypothesis such that

. The first hypothesis is tested at level

H _i α ′ _i = α

∑ ^H ¹ ^α ¹ ^{= ′} ^α ¹ . The second hypothesis ₂ is tested at

level

H

α ₂ = α ₁ + ′ α ₂ if H ₁ was rejected, otherwise at level α ₂ = ′ α ₂ . Subsequent H _i is tested at level α _i = α _i−1 + ′ α _i if H _i−1 was rejected, otherwise at level α _i = ′ α _i . Thus α ′ _i ’s are accumulated as long as the previous hypotheses are rejected.

An example for n = 7 H _i ’s is as follows:

Test A priori α _i ′ ’s Test to be performed at level

Assumed outcome of test

1 α ₁ ′ α ₁ = ′ α ₁ Sign.

2 α ′ ₂ α ₂ = α ₁ + ′ α ₂ Non-sign.

3 α ′ ₃ α ₃ = ′ α ₃ Sign.

4 α ′ ₄ α ₄ = α ₃ + ′ α ₄ Sign.

5 α ′ ₅ α ₅ = α ₄ + ′ α ₅ Sign.

6 α ′ ₆ α ₆ = α ₅ + ′ α ₆ Non-sign.

7 α ′ ₇ α ₇ = ′ α ₇

The ordinary fixed sequence procedure described in section 3.4 is a special case with α ₁ ′ = α and all other α _i ′ set to zero. Wiens’ generalization controls the FWER in the strong sense.

3.6 Fixed sequence procedure for groups of hypotheses

(13)

Another very useful generalization of the fixed sequence procedure is when groups of hypotheses are considered. Consider the following M > 1 groups G ₁ , Κ , G _M of hypotheses,

G ₁ = H { _1,1 ,H _1,2 , K ,H 1, n

1

}

G ₂ = H { _2,1 ,H _2,2 , K ,H 2,n

2

}

M

G _M = H { _{M ,1} ,H _{M ,2} , K ,H _{M ,n}

_M

} ^.

Each group can be viewed as a small family of hypotheses. It is assumed here that the hypotheses in G are tested through some pre-specified MCP that strongly controls its FWER to be

G m n _m n _m

m

α

≤ . Different such MCPs may be used for different groups G _m .

The fixed sequence approach can then be applied to the successive groups (in their a priori order) as follows:

Step 1: Reject hypotheses _i in ₁ with its MCP. If all hypotheses in ₁ are rejected, then move on to the next step, otherwise stop.

H _1, G G

…

Step m: Reject hypotheses in G with its MCP. If all hypotheses in are rejected, then move on to the next step, otherwise stop; and so on.

i

H m _, _m G _m

Note again that any MCP that strongly controls its FWER to be ≤ can be used, and that different α MCPs can be used in the different groups ; for example, the Bonferroni procedure, the

Bonferroni-Holm procedure, or the fixed sequence procedure. The fixed (and a priori specified) order of the groups G may be by interest and/or anticipated high initial rejection probabilities.

This generalization strongly controls the FWER to be G m

m

α

≤ . The Fixed sequence procedure in 3.4 is a special case with n _m =1 for each m.

4. A general MCP: the closed test procedure

The closed test procedure (Marcus, Peritz & Gabriel, 1976) consists of three parts:

(14)

Part 1: Extension of the family.

Let H ¹ , H ₂ , K , H _n be the basic family of null hypotheses of interest and extend this family with all non-empty subsets,

H _I =

∩ i∈I ^H ⁱ ^, ^I ^{⊂ 1,K , n} { ^} . This new extended family is closed under intersections, no new hypotheses can be formed by further intersections. An

intersection hypothesis is by definition considered to be true if all its component basic hypotheses , , are true; otherwise is considered to be false.

H I

H i i ∈ I H _I

The situation with 4 basic hypotheses H ₁ , H ₂ , H ₃ , H ₄ can be illustrated as follows:

H ₁₂₃₄

H ₁₂₃ H ₁₂₄ H ₁₃₄ H ₂₃₄ H ₁₂ H ₁₃ H ₁₄ H ₂₃ H ₂₄ H ₃₄

H ₁ H ₂ H ₃ H ₄

Here the original basic hypotheses are placed at the bottom, whereas the intersection with of all these hypotheses is placed at the top.

4 3 2 1 , H , H , H H

I }

H ^I ⁼ { ¹ ^, ² ^, ³ ^, ⁴

Part 2: Specification of marginal α -level tests for the intersection hypotheses.

A marginal α -level test is associated with each hypothesis H _I . Such a test for H _I is thus such that it rejects H _I with marginal probability ≤ if H α _I is true, that is if all the component basic hypotheses H _i , i ∈ , are true. I

Example of possible tests for H _I is a simple Bonferroni test: H _I is rejected if at least one of the component basic hypotheses H _i , i ∈ , has a p-value I p _i ≤ α / n _I . Here n _I denotes the number of elements in I, that is the number of basic hypotheses in the intersection H _I . Any other test of H _I that has level α can however be used. A test proposed by Simes (1986) is considered in this context in section 5.

Part 3: Rejection procedure.

By definition of the closed test procedure, H _I is rejected if and only if each H _J

with I ⊂ J ⊂ 1,2,K ,n { } is rejected by its marginal α -level test. This test procedure controls

the FWER to be ≤ α strongly.

(15)

An example is clarifying. Consider the situation with 4 hypotheses H ₁ , H ₂ , H ₃ , H ₄ and let H _I ^∗ indicate an observed significant marginal α -level test:

H ₁₂₃₄ ^*

H ₁₂₃ ^∗ H ₁₂₄ ^∗ H ₁₃₄ ^∗ H ₂₃₄ H ₁₂ ^∗ H ₁₃ ^∗ H ₁₄ ^∗ H ₂₃ H ₂₄ H ₃₄

H ₁ ^∗ H ₂ ^∗ H ₃ H ₄

In this example only in the basic family of hypotheses is rejected by the closed test procedure. Even though ₂ is found to be significant by its marginal

H ₁ H ₁ , H ₂ , H ₃ , H ₄

H α -level test it

is not rejected by the closed test procedure. This is because not all intersections H _J with are found to be significant with their marginal

{ } 2 ⊂ J ⊂ 1,2,3,4 { } ^α -level tests. For example

is not rejected by its marginal

H ₂₃ α -level test.

5. MCPs valid under certain independence/dependence assumptions

The procedures described in this section are more powerful than those described in section 3, but they require that the p-values p ₁ , p ₂ , Κ , p _n corresponding to the family H ¹ , H ₂ , K , H _n of interest are independent or positively dependent in a certain sense (Sarkar & Chang 1997, Sarkar 1998). An example of when p-values are positively dependent in this sense is when several treatments are compared to one treatment (for example a placebo) with a Dunnett-type test. An example of dependent p-values that are not positively dependent in this sense is when all treatments are compared pairwise with a Tukey-type test. However it may be possible to use the procedures described in this section even if the p-values are not independent or positively dependent, a simulation study then needs to be carried out to ensure that a FWER ≤ is kept. α

Here two MCPs will be considered, one proposed by Hochberg (1988) and one proposed by Hommel (1988). Hommel’s and Hochberg’s procedures are based on the closed test procedure and some work by Simes (1986).

Simes’s test of an intersection hypothesis H ₀ . Suppose H ₀ is the intersection

H ₀ = ∩ H { _i : i =1,K , n } of all hypotheses _i in the family considered. This ₀ can be tested by using a Bonferroni approach, which means that is rejected if

H H

H ₀ p ₍ ₁ ₎ ≤ α / n , where as before is the smallest of the p-values . This is a rather conservative test. Simes proposed another

p ₍₁₎

p n

p

p ₁ , ₂ , Κ ,

(16)

test of this H ₀ : reject H ₀ if p _{( j)} ≤ j α

n for any j =1,K , n. Simes showed that this test of has level

H ₀

α if the p-values p ₁ , p ₂ , Κ , p _n are independent under H ₀ .

H _I

( ) 2

( , , H _n

H Κ

p

) 1 , H (

( n )

α H ₍ ₁ ₎ ,H ₍ ₂ ₎ , Κ , H _(n ₎

α _(n

p _{( j}

α +1) H ₍₁₎ , K

α /( ) H _{( j}

p ₍₁₎

Simes’s test of an intersection hypothesis is obviously more powerful than Bonferroni’s. Simes didn’t show how to move on and make statements on the individual component hypotheses.

Hommel (1988) and Hochberg (1988) used the closed test procedure to extend Simes test for this purpose. Briefly, the idea is to use Simes’s test as the marginal α -level test for each intersection hypothesis in the closed test procedure. Details about how Hochberg’s and Hommel’s procedure can be carried out are given below.

5.1 Hochberg’s procedure

Hochberg’s procedure has similarities with Bonferroni-Holm and can be described as a step-up version of it. As in Bonferroni-Holm’s procedure in section 3.2, p ⁽¹⁾ < p ₍₂₎ < K < p (n

)

p (n

) denote the ordered p-values, and denote the corresponding null-hypotheses. Instead of beginning with the smallest p-value , you start with the largest p-value , and move up in the Bonferroni-Holm structure. Instead of rejecting one hypothesis at a time, all hypotheses with a smaller p-value than the rejected one are also rejected. Thus Hochberg’s procedure follows the following scheme

)

) 1 (

Step 1: Look at the largest p-value p

If it is smaller than reject and stop If it is larger than accept H ₎ and go on to step 2

…

Step j: Look at the j:th largest p-value ₎

If it is smaller than /(n − j reject , H _{( j)} and stop If it is larger than n − j +1 accept ₎ and go on to step j+1

…

Step n: Look at the smallest p-value

If it is smaller than α / n reject H and stop ₍₁₎

(17)

If it is larger than α / n accept H ₍₁₎ and stop

Hochberg’s procedure strongly controls the FWER to be ≤ , and is more powerful and somewhat α simpler than Bonferroni-Holm. Briefly, Hochberg’s procedure can be shown to be a simplified version of the closed test procedure with Simes’s test used for each intersection hypothesis H _I . See Hochberg (1988) for details.

5.2 Hommel’s procedure

Hommel’s procedure is somewhat more powerful than Hochberg but is more difficult to understand and carry out. Actually, it is equivalent to the closed test procedure with Simes’s test used for each intersection hypothesis H _I at level α . See Hommel (1988) for details.

The procedure is as follows: reject all hypotheses that have a p-value ≤ α / j where j is defined as

j = max i ∈ 1,K , n { } ^{: p} ( n−i+k) > k α

i for k =1,…,i



  

 

If no maximum exist, all hypotheses are rejected (the largest p-value is then smaller than α ).

Hommel’s procedure is more powerful than Hochberg but there are a little more computations to carry out. Consider the following example that illustrates the fact that Hommel’s procedure rejects more than Hochberg’s:

Suppose that we have 3 hypotheses ₃ , that the p-values are , and , and that

H ₁ , H ₂ , H p ( = 0.024 ₁ ₎ p ( = 0.030 ₂ ₎ p ( = 0.073 ₃ ₎ α = 0.05. With Hommel’s procedure we first calculate j:

For i=1: p ( = 0.073 > ₃ ₎ α = 0.05

For i=2: p ( = 0.073 > ₃ ₎ α = 0.05, p ( = 0.030 > ₂ ₎ α /2 = 0.025

For i=3: p ( = 0.073 > ₃ ₎ α = 0.05, p ( = 0.030 < 2 ₂ ₎ α / 3 = 0.033, p ( = 0.024 > ₁ ₎ α / 3 = 0.0167

So in this example j = max 1,2 { } = 2 and all hypotheses with a p-value ≤ α /2 = 0.025 are

rejected. The hypothesis with p-value p ( = 0.024 is then rejected by Hommel’s procedure. ₁ ₎

(18)

If Hochberg’s procedure would be used instead, no hypotheses would be rejected:

is larger than

p ( = 0.073 ₃ ₎ α = 0.05, p ( = 0.030 is larger than ₂ ₎ α /2 = 0.025 and is larger than

p ( = 0.024 ₁ ₎ α / 3 = 0.0167.

Hommel’s procedure can be improved by using logical relations between the hypotheses H ¹ , H ₂ , K , H _n in a manner that is analogous to Shaffer’s improvement of the Bonferroni-Holm procedure; See Hommel (1988).

6. Power and Sample size

It is assumed in the following considerations that a certain target subset, , of the

hypotheses in the family considered are false in a specified way. In these specifications of the false hypotheses, assumptions have to be made concerning distributions, parameter-values, etc.

T i H _i , ∈ H n

H ₁ , Κ ,

As mentioned in section 2.4, there are three common definitions of power, numbered (1), (2), (3) there. Given any of these three power-concepts, and given the MCP to be used, the power can in principle be determined for a given experimental design and given parameter values (including given sample sizes). The sample size problem then consists in varying (increasing) the sample sizes until this power has reached a desired level, typically around 0.90.

The sketch just given is now illustrated in more details in two examples.

6.1 Illustration 1: Bonferroni and Power-concept (3)

Consider example 2.2 in section 2.1 with three hypotheses H ₁ , H ₂ , H ₃ concerning the differences

, , ,

3 2 1

C B

C A

B A

µ µ

−

=

∆

−

=

∆

−

=

∆

(19)

between population means µ _A , µ _B and µ _C . Hypothesis H _i states that ∆ _i ≥ 0 N

. We assume that Y- responses are available from each of three underlying normal distributions, ,

, ) . The population variances are unknown and possibly distinct.

N )

, ( µ _A σ ² _A )

, ( _B _B ²

N µ σ N ( µ _C , σ _C ²

Suppose now that the MCP to be used is the ordinary Bonferroni procedure, with to be tested at marginal level

0 : ∆ _i ≥ H i

3 α / through the test statistics

, / /

/ ) (

, / /

/ ) (

, / /

/ ) (

2 2

3 2 2

2 2 2

1 N s N s Y Y t

N s N s Y Y t

C B

C A

B A

+

−

=

+

−

=

+

−

=

with obvious notation, and α set equal to 0.05. In these tests, each of these test statistics can be treated as ordinary t-statistics with N − 1 degrees of freedom: H _i : ∆ _i ≥ 0 is rejected if the observed value t _i - obs of is t _i ≤ t _α _/ ₃ _, _N ₋ ₁ , with obvious notation. This ensures that each is tested at marginal significance level

H i

3 α /

≤ . The relevant marginal p-value corresponding to an observed value is then given by

p i

obs

i - t

[ ^- ^obs ^| ^is ^true ] , Pr _i _, ₀ _i _i

i t t H

p = ≤

where denotes a random variable distributed according to a central t-distribution with degrees of freedom.

0 ,

t i N − 1

Suppose also that we are interested in power-concept (3), which means that all false hypotheses in the target T of false hypotheses are to be rejected. Here the target “subset” T of false hypotheses of interest is taken to consist of the entire family of three hypotheses considered. Thus all

are assumed to be false, and we are interested in the power of rejecting all these three hypotheses with the Bonferroni procedure (which has FWER

H ₁ , H ₂ , H ₃ H ₁ , H ₂ , H ₃

α

≤ ). The following anticipated values are used:

, , ,

3 3

2 2

1 1

∗

∆

=

∆

=

∆

=

∆

(20)

and , , , where an asterisk indicates anticipated numerical values of the unknown parameters. The specified values

2 2 = _A ∗

A σ

σ σ _B ² = σ _B ² _∗ σ _C ² = σ _C ² _∗

∗

∗ ∆ ∆

∆ ₁ , ₂ , ₃ are all 0 < , because they correspond to alternative hypotheses.

Given these anticipated values and any given , the power can now in principle be determined.

This power is equal to

N

[ and and are rejected by B ] , Pr H ₁ H ₂ H ₃

Power ≡

where B indicates the Bonferroni procedure based on the or the as just described. In terms of the marginal p-values of the three tests, this power equals

statistics

i -

t p _i - values

3 2 1 , p , p p

[ / 3 and / 3 and / 3 ]

Pr p ₁ ≤ α p ₂ ≤ α p ₃ ≤ α .

By using Bonferroni’s inequality this power can be bounded from below as follows

[ ] [ ] [ ]

[ p ] [ p ] [ p ] Power lower bound

p p

p Power

≡

−

≤ +

≤

=

>

−

>

−

>

−

≥

2 3 / Pr

3 / Pr

1 3 2

1 α α

α

α α

α

Each probability Pr [ H _i rejected by its marginal test ] = Pr [ p _i ≤ α / 3 ] = Pr [ t _i ≤ t _α / 3 , _N ₋ 1 ] can be approximated through a large-N normal approximation in terms of the standard normal distribution function as follows

[ ^p ^/ ³ ] ^Pr [ ^t ^t ^N ] ( ^z ^/ ⁽ ^A ^B ⁾ ^/ ^N )

Pr ₁ ≤ α = ₁ ≤ _α _/ ₃ _, ₋ ₁ ≈ Φ _α _/ ₃ − ∆ ₁ _∗ σ ² _∗ + σ ² _∗

[ ^p ^/ ³ ] ^Pr [ ^t ^t ^N ] ( ^z ^/ ⁽ ^A ^C ⁾ ^/ ^N )

Pr ₂ ≤ α = ₂ ≤ _α _/ ₃ _, ₋ ₁ ≈ Φ _α _/ ₃ − ∆ ₂ _∗ σ ² _∗ + σ ² _∗

[ ^p / 3 ] Pr [ ^t ^t N ] ( ^z / ( B C ) / ^N )

Pr ₃ ≤ α = ₃ ≤ _α _/ ₃ _, ₋ ₁ ≈ Φ _α _/ ₃ − ∆ ₃ _∗ σ ² _∗ + σ ² _∗

where is the z _α _/ ₃ = Φ ⁻ ¹ ( α / 3 ) α / 3 quantile of the standard normal distribution. The Power lower bound just derived can thus easily be approximated in terms of these expressions.

The problem is now to calculate this lower bound for varying (increasing) values of N until a desired level is reached. A SAS program is given in appendix 2, where the anticipated values

∆ _1∗ = −2, ∆ _2∗ = −4, ∆ ₃ _∗ = −2 (corresponding to µ _A∗ = 3, µ _B∗ = 5, µ _C∗ = 7), and σ _A∗ = 2.0 , σ _B∗ = 2.5 ,

σ _C∗ = 3.0 are used. The following table is obtained

(21)

N Power lower bound

45 0.88491 46 0.89329 47 0.90108 48 0.90832

Thus if a guaranteed power of at least 0.90 is desired, then according to these calculations, one should use N equal to 47 if Bonferroni’s procedure is to be used.

Instead of large-N normal approximation used, one may evaluate the probabilities exactly through the non-central t-distribution, for example using relevant SAS-functions.

6.2 Illustration 2: Bonferroni-Holm and Power-concept (3)

Consider again example 2.2 with three hypotheses . The same assumptions and notation as in section 6.1 are used.

H ₁ , H ₂ , H ₃

When Bonferroni was used in 6.1, all p-values p _i were compared to the same level, α / 3. Here Bonferroni-Holm is used and the p-values are compared to different levels, the smallest will be compared with α / 3, the second smallest will be compared with α /2 and the largest will be compared with α .

The power of type (3) of rejecting all three false hypotheses with the Bonferroni-Holm procedure is

[ ând ând âre ^rejected ^by ^BH ] , Pr H ₁ H ₂ H ₃

Power ≡

where BH indicates the Bonferroni-Holm procedure. In terms of the ordered marginal p-values of the three tests, this power equals

) 3 ( ) 2 ( ) 1

( p p

p ≤ ≤

[ ^/ ³ ^and ^/ ² ^and ^/ ¹ ] . Pr p ₍ ₁ ₎ ≤ α p ₍ ₂ ₎ ≤ α p ₍ ₃ ₎ ≤ α

Now, let ( be any given permutation of the three integers ( . The following inequality then holds,

) , , ₂ ₃

1 i i

i 1 , 2 , 3 )

(22)

[ ^/ ³ ând ^/ ² ând ^/ ¹ ] ^Pr [ ^/ ³ ând ^/ ² ând ^/ ¹ ]

Pr p ( 1 ) ≤ α p ( 2 ) ≤ α p ( 3 ) ≤ α ≥ p _i

1

≤ α p _i

2

≤ α p _i

3

≤ α

because if the event in the right probability occurs, then the event in the left probability occurs. The right probability can itself be bounded from below through Bonferroni’s inequality, which leads to the inequality

[ ^/ ³ ] [ ^Pr ^/ ² ] ^Pr [ ^/ ¹ ] ²

Pr

1

≤ +

2

≤ +

3

≤ −

≥ p _i α p _i α p _i α

Power

Moreover, since this latter inequality holds for all permutations of ( , it holds also with the right member replaced by the maximum over all such permutations. This means that

) , ,

( i ₁ i ₂ i ₃ 1 , 2 , 3 )

[ ] [ ] [ ]

( ^p ^p ^p ) ^Power ^lower ^bound

Power _i _i _i

i i

i ≤ + ≤ + ≤ − ≡

≥ max Pr / 3 Pr / 2 Pr / 1 2

3 2

3 1 2 1

, , )

( α α α

where the maximum is taken over all permutations ( of . The power for each marginal test involved in this expression,

) , , ₂ ₃

1 i i

i ( 1 , 2 , 3 )

[

₁

rejected ] [ Pr

₁

/3 ] [ Pr

₁

/ 3 , 1 ]

Pr H _i ≡ p _i ≤ α = t _i ≤ t _α _N ₋ ,

[

₂

rejected ] [ Pr

₂

/ 2 ] [ Pr

₂

/ 2 , 1 ]

Pr H _i ≡ p _i ≤ α = t _i ≤ t _α _N ₋ ,

[

3

rejected ] [ Pr

3

/ 1 ] [ Pr

3

, 1 ]

Pr H _i ≡ p _i ≤ α = t _i ≤ t _α _N ₋ ,

can be approximated through a normal approximation as in section 6.1, with α / 3 appropriately replaced by α / 3 , 2 α / , 1 α / . This leads to an expression for the Power lower bound that can be easily programmed.

With 3 hypotheses there are 3!=6 different permutations. Below is a table that gives the level of significance at which each hypothesis should be tested for the 6 different permutations.

Test of

Permutation 1

Permutation 2

Permutation 3

Permutation 4

Permutation 5

Permutation 6

H ₁ α α α /2 α /2 α / 3 α / 3

H ₂ α /2 α / 3 α α / 3 α α /2

H ₃ α / 3 α /2 α / 3 α α /2 α

(23)

For example, for permutation 4, H ₁ will be tested at level α /2, H ₂ will be tested at level α / 3 and will be tested at level

H ₃ α . The power for each marginal test in permutation 4 can thus be approximated as follows:

[ ^p ^/ ² ] ^Pr [ ^t ^t ^N ] ( ^z ^/ ⁽ ^A ^B ⁾ ^/ ^N )

Pr ₁ ≤ α = ₁ ≤ _α _/ ₂ _, ₋ ₁ ≈ Φ _α _/ ₂ − ∆ ₁ _∗ σ ² _∗ + σ ² _∗

[ ^p ^/ ³ ] ^Pr [ ^t ^t ^N ] ( ^z ^/ ⁽ ^A ^C ⁾ ^/ ^N )

Pr ₂ ≤ α = ₂ ≤ _α _/ ₃ _, ₋ ₁ ≈ Φ _α _/ ₃ − ∆ ₂ _∗ σ ² _∗ + σ ² _∗

[ ^p ] ^Pr [ ^t ^t ^N ] ( ^z ^/ ⁽ ^B ^C ⁾ ^/ ^N )

Pr ₃ ≤ α = ₃ ≤ _α _, ₋ ₁ ≈ Φ _α − ∆ ₃ _∗ σ ² _∗ + σ ² _∗ .

The Power lower bound can then be approximated through similar expression for other permutations, and one then only has to vary (increase) N until a desired level is reached.

A SAS program is given in appendix 3, where the anticipated values ∆ _1∗ = −2, ∆ _2∗ = −4, ∆ ₃ _∗ = −2 (corresponding to µ _A∗ = 3, µ _B∗ = 5, µ _C∗ = 7), and σ _A∗ = 2.0 , σ _B∗ = 2.5 , σ _C∗ = 3.0 are used. The tests are almost the same as for the Bonferroni procedure in section 6.1, the only difference is that different z-values are used. The following table is obtained by running the SAS program

N Power lower bound

36 0.88648 37 0.89638 38 0.90544 39 0.91373

Thus if a guaranteed power of at least 0.90 is desired, then according to these calculations, one should use N equal to 38 if Bonferroni-Holm’s procedure is to be used.

It should be noted from section 6.1 that for the same situation, a much larger sample size N is required with the Bonferroni procedure than with Bonferroni-Holm procedure: 47 instead of 38.

Again, instead of the large-N approximation used, one may evaluated the probabilities exactly

through the non-central t-distribution.

(24)

Appendix 1 Proof of strong control for the fixed sequence procedure (FS)

Two cases are possible: (i) either all hypotheses are false, or (ii) there is at least one true hypothesis. These two cases are treated separately.

First case: Suppose no hypothesis H _j in the family H ¹ , H ₂ , K , H _n is true. Then no true H _j will be rejected with FS and the probability of making a type I error is zero

Pr type I error with FS [ ] ^{= 0}

(25)

Second case: Suppose at least one hypothesis H _j in the family H ¹ , H ₂ , K , H _n is true. Let f be the index of the first true hypothesis. Of course this f is unknown. Then

type I error with FS

[ ] ≡ at least 1 true H [ _j is rejected with FS ] ⁼

H _f is rejected with FS

[ ] ^{⊂ H} [ ^f has p - value p _f ≤ α ]

This means that the probability of making a type I error satisfies

. Pr type I error with FS [ ] ^{≤ Pr H} [ f has p - value p _f ≤ α ] ^≤ ^α

This holds no matter how many or which of the hypotheses H ¹ , H ₂ , K , H _n are true, which means that the type I FWER is strongly controlled to be ≤ α .

Appendix 2 data power(keep= N Power);

alpha=0.05; zcrit=probit(alpha/3);

mya=3; myb=5; myc=7;

d1=mya-myb; d2=mya-myc; d3=myb-myc;

sigma=2; sigmb=2.5; sigmc=3;

do N=1 to 50 by 1;

phi1=probnorm(zcrit-d1/sqrt((sigmA2+SigmB2)/N));

phi2=probnorm(zcrit-d2/sqrt((sigmA2+SigmC2)/N));

phi3=probnorm(zcrit-d3/sqrt((sigmB2+SigmC2)/N));

Power=phi1+phi2+phi3-2;

output;

end;

proc print data=power; run;

Appendix 3 data power(keep= N Power );

alpha=0.05; zcrit1=probit(alpha);zcrit2=probit(alpha/2);zcrit3=probit(alpha/3);

mya=3; myb=5; myc=7;

d1=mya-myb; d2=mya-myc; d3=myb-myc;

sigma=2; sigmb=2.5; sigmc=3;

do N=1 to 50 by 1;

phi11=probnorm(zcrit1-d1/sqrt((sigmA2+SigmB2)/N));

phi12=probnorm(zcrit2-d2/sqrt((sigmA2+SigmC2)/N));

phi13=probnorm(zcrit3-d3/sqrt((sigmB2+SigmC2)/N));

Power1=phi11+phi12+phi13-2;

phi21=probnorm(zcrit1-d1/sqrt((sigmA2+SigmB2)/N));

(26)

phi22=probnorm(zcrit3-d2/sqrt((sigmA2+SigmC2)/N));

phi23=probnorm(zcrit2-d3/sqrt((sigmB2+SigmC2)/N));

Power2=phi21+phi22+phi23-2;

phi31=probnorm(zcrit2-d1/sqrt((sigmA2+SigmB2)/N));

phi32=probnorm(zcrit1-d2/sqrt((sigmA2+SigmC2)/N));

phi33=probnorm(zcrit3-d3/sqrt((sigmB2+SigmC2)/N));

Power3=phi31+phi32+phi33-2;

phi41=probnorm(zcrit2-d1/sqrt((sigmA2+SigmB2)/N));

phi42=probnorm(zcrit3-d2/sqrt((sigmA2+SigmC2)/N));

phi43=probnorm(zcrit1-d3/sqrt((sigmB2+SigmC2)/N));

Power4=phi41+phi42+phi43-2;

phi51=probnorm(zcrit3-d1/sqrt((sigmA2+SigmB2)/N));

phi52=probnorm(zcrit1-d2/sqrt((sigmA2+SigmC2)/N));

phi53=probnorm(zcrit2-d3/sqrt((sigmB2+SigmC2)/N));

Power5=phi51+phi52+phi53-2;

phi61=probnorm(zcrit3-d1/sqrt((sigmA2+SigmB2)/N));

phi62=probnorm(zcrit2-d2/sqrt((sigmA2+SigmC2)/N));

phi63=probnorm(zcrit1-d3/sqrt((sigmB2+SigmC2)/N));

Power6=phi61+phi62+phi63-2;

Power=max(Power1,Power2,Power3,Power4,Power5,Power6);

Output;

end;

proc print data=power;

run;

References

Hochberg, Y. (1988). A sharper Bonferroni procedure for multiple tests of significance.

Biometrika. 75, 800-802.

Hochberg, Y. & Tamhane, A. (1987). Multiple comparison procedures. Wiley, New York.

Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scand. J. Statist. 6, 65-70.

Hommel, G. (1988). A stagewise rejective multiple test procedure based on a modified Bonferroni test. Biometrika. 75, 383-386.

Hsu, J. C. (1996). Multiple Comparisons, Chapman & Hall, London.

(27)

Marcus, R, Peritz, E & Gabriel KR. (1976). On closed test procedures with special reference to ordered analysis of variance. Biometrika. 63, 655-660.

Sarkar, S. K. (1998). Some probability inequalities for ordered MTP ₂ random variables: A proof of the simes conjecture. The annals of statistics. 26, 494-504

Sarkar, S. K. & Changg, C-K (1997). The Simes method for multiple hypothesis testing with positively dependent test statistics. J. Am. Stat. 92, 1601-1608.

Simes, R. J. (1986). An improved bonferroni procedure for multiple tests of significance.

Biometrika. 73, 751-754.

Shaffer, J. (1995). Multiple hypothesis testing. Annu. Rev. Psychol. 46, 561-584.

Shaffer, J. (1986). Modified sequentially rejective multiple test procedures. J. Am. Stat. 81, 826-831.

Westfall, P, & Young, S. (1993). Resampling-based multiple testing. Wiley, New York Wiens, B. (2003). A fixed sequence Bonferroni procedure for testing multiple endpoints.

Pharmaceut. statist. 2, 211-215.

Multiple comparison procedures based on marginal p-values

Multiple comparison procedures based on marginal p-values

Martin Ekenstierna

U.U.D.M. Project Report 2004:12

Examensarbete i matematisk statistik, 20 poäng Handledare: Olivier Guilbaud, AstraZeneca

Examinator: Silvelyn Zwanzig Juni 2004

Department of Mathematics

Uppsala University

Abstract

considerable practical importance. Sample size calculations are illustrated for two of the methods,

the Bonferroni procedure and the Bonferroni-Holm procedure.

Acknowledgment

I would like to thank Olivier Guilbaud at AstraZeneca for his engagement and all his advice and

help he gave me.

1. INTRODUCTION 4

2. BASIC CONCEPTS AND TERMONOLOGY FOR MCPS 5 2.1 Family of (null-) hypotheses H i considered 5 2.2 P-values 6

2.3 Familywise error rate (FWER) of type I errors – weak and strong control 6 2.4 Power 7 3. MCPS VALID QUITE GENERALLY 8

3.1 The Bonferroni procedure 8

3.2 Bonferroni-Holm’s procedure 8

3.3 Shaffer’s improvement of Bonferroni-Holm 9 3.4 Fixed sequence procedure 10

3.5 Wiens’ generalization of the fixed sequence procedure 11

3.6 Fixed sequence procedure for groups of hypotheses 12

4. A GENERAL MCP: THE CLOSED TEST PROCEDURE 13

5. MCPS VALID UNDER CERTAIN INDEPENDENCE/DEPENDENCE ASSUMPTIONS14 5.1 Hochberg’s procedure 15

5.2 Hommel’s procedure 16

6. POWER AND SAMPLE SIZE 17

6.1 Illustration 1: Bonferroni and Power concept (3) 18

6.2 Illustration 1: Bonferroni-Holm and Power concept (3) 20

APPENDIX 1: Proof of strong control for the fixed sequence procedure (FS) 24

APPENDIX 2 24

APPENDIX 3 25

REFERENCES 26

1. Introduction

In clinical studies various multiplicity problems do occur. For example, there may be various comparisons that are of interest. Such comparisons may concern several response-variables, more than two treatments, and/or several sub-groups of subjects.

When testing a single hypothesis, a type I error is made if a hypothesis is rejected although the hypothesis is actually true. The probability of making such an error is often controlled to be smaller than a certain level α . If several hypotheses are tested, a type I error can be made for each

In some situations there may be some variables and/or comparisons that are more important than others. Some of the methods described in this text take this aspect into account.

This paper will mainly deal with methods based on marginal p-values associated with the hypotheses of interest. These methods are of considerable practical importance because of their simplicity, flexibility, and general applicability.

General references about multiple comparison procedures (MCPs) are Hochberg & Tamhane (1987), Westfall & Young (1993), and Hsu (1996).

In section 2, some basic concepts and terminology are introduced. Section 3 deals with some MCPs

that are valid quite generally, without any assumption about independences or dependences between

the tests. Section 4 deals with a very general MCP that also is valid quite generally – the closed test

procedure. Section 5 deals with two MCPs that requires independent tests (or at least a certain kind

of dependence). Finally, in section 6, some power and sample size calculations are illustrated that

are based on simple approaches which are not computer intensive.

2. Basic concepts and terminology for multiple comparison procedures (MCPs)

2.1 Family of (null-) hypotheses H i considered

The family of (null-) hypotheses

H 1 , H 2 , K , H n

Let k be the number of true hypothesis in this family in a given situation. Of course this number is unknown but it is nevertheless sometimes possible to say something about it. Consider the

following examples, where the means µ can be thought of as true mean responses in a parallel group study.

Example 2.1:

C B

C A

B A

H H H

µ µ

µ µ

µ µ

=

=

=

: : :

3 2 1

This is a family with n = 3 null hypotheses. The number k of true hypotheses can be 0,1 or 3; but not 2, because, for example, if H 1 and H 2 are true then H 3 must also be true.

Example 2.2:

H 1 : µ A ≥ µ B H 2 : µ A ≥ µ C H 3 : µ B ≥ µ C

In this example the number k of true hypotheses can be 0, 1, 2 or 3.

Example 2.3:

H 1 : µ A − µ B ≤ 0 H 2 : µ A − µ B ≥ 0

In this example the number k of true hypotheses can be 1 or 2; but not 0, because both can’t be false.

Example 2.4:

H 1 : µ A = µ Placebo H 2 : µ B = µ Placebo H 3 : µ C = µ Placebo

In this example the number k of true hypotheses can be 0, 1, 2 or 3.

2.2 P-values

Briefly, a p-value is the probability under a null hypothesis of observing a test statistic for that is as extreme as, or more extreme than, the observed value in the direction of rejection.

Such a p-value gives more information than simply a reject or accept decision about at some level

p i H i

H i

p i H i

α , and it can be seen as the level at which would just barely be rejected. The p-value is a random variable (since it depends on the outcome of the test statistic), and it satisfies

2. BASIC CONCEPTS AND TERMONOLOGY FOR MCPS 5 2.1 Family of (null-) hypotheses H _i considered 5 2.2 P-values 6

2.1 Family of (null-) hypotheses H _i considered

H ¹ , H ₂ , K , H _n

This is a family with n = 3 null hypotheses. The number k of true hypotheses can be 0,1 or 3; but not 2, because, for example, if H ₁ and H ₂ are true then H ₃ must also be true.

H ₁ : µ _A ≥ µ _B H ₂ : µ _A ≥ µ _C H ₃ : µ _B ≥ µ _C

H ₁ : µ _A − µ _B ≤ 0 H ₂ : µ _A − µ _B ≥ 0

H ₁ : µ _A = µ _Placebo H ₂ : µ _B = µ _Placebo H ₃ : µ _C = µ _Placebo

p i H _i

p i H _i

H i p _i

[ ^p i ≤ ^u | ^H i is true ] ≤ ^u Pr

H i H _i H _i

When there are several hypotheses, H ¹ , H ₂ , K , H _n , you also want to control the type I error at

H _n 2.4 Power

When testing a single null hypothesis H _i at level through its p-value , the power is the probability

[ p i ^≤ ^α ]

H i H _i

When there are several hypotheses, H ¹ , H ₂ , K , , in the family of interest, there are three different definitions of power that are common:

Reject all hypotheses H _i in the family, H ¹ , H ₂ , K , H _n , that have a p-value p _i ≤ α / n where n is the number of hypotheses H _i .

There is also a weighted version where p _i is compared to α _i , where the α _i ’s satisfy ∑ ^{. The}

α _i = α n

H _i n

Step 1: Look at the smallest p-value p ₍₁₎

If it is ≤ α / n then reject H ₍₁₎ and go on to step 2 If it is > α / n then accept H ₍ ₁ ₎ , H ₍ ₂ ₎ , Κ , H ₍ _n ₎ and stop

Step j: Look at the j:th smallest p-value p _{( j )}

If it is ≤ α /( n − j + 1 ) then reject H _{( j} ₎ and go on to step j +1

If it is > α /( n − j + 1 ) then accept H ^{( j)} , K , H (n ) and stop

Step n: Look at the largest p-value p _{(n )} If it is ≤ then reject α H _{(n )} and stop If it is > then accept α H _(n ₎ and stop

} { ⁰ ^, ¹ ^, ³

α is n − j +1 t _j

, whereas in Shaffer’s improvement, a smaller denominator is used. This denominator is defined as: t _j

t _j = max r ∈ K; r ≤ n − j +1 { } ^for j =1,K , n

It follows from this definition that t _j ≤ n − j +1 , so more hypotheses can be rejected.