• No results found

The Two-Sample t-test and the Influence of Outliers

N/A
N/A
Protected

Academic year: 2021

Share "The Two-Sample t-test and the Influence of Outliers"

Copied!
30
0
0

Loading.... (view fulltext now)

Full text

(1)

The Two-Sample t-test and

the Influence of Outliers

- A simulation study on how the type I error rate is

impacted by outliers of different magnitude.

Bachelor’s thesis

Department of Statistics

Uppsala University

Date: 2019-01-15

Author: Carl Widerberg

(2)

Abstract

This study investigates how outliers of different magnitude impact the robustness of the two-sample t-test. A simulation study approach is used to analyze the behavior of type I error rates when outliers are added to generated data. Outliers may distort parameter estimates such as the mean and variance and cause misleading test results. Previous research has shown that Welch’s t-test performs better than the traditional Student’s t-t-test when group variances are unequal.

Therefore these two alternative statistics are compared in terms of type I error rates when outliers are added to the samples. The results show that control of type I error rates can be maintained in the presence of a single outlier. Depending on the magnitude of the outlier and the sample size, there are scenarios where the t-test is robust. However, the sensitivity of the t-test is illustrated by deteriorating type I error rates when more than one outlier are included. The comparison between Welch’s t-test and Student’s t-test shows that the former is marginally more robust against outlier influence.

Keywords

(3)

Table of content

1. Introduction ... 1

2. Theory and methodology ... 3

2.1 What is an outlier? ... 3 2.2 Methodology ... 4 2.2.1 ANOVA ... 4 2.2.2 Student’s t-test. ... 7 2.2.3 Welch’s t-test ... 8 3. Simulation study ... 10 3.1 Sample size ... 12 3.2 Outlier values ... 13 3.3 Simulation set up ... 14 4. Results ... 16

4.1 Small sample size ... 16

4.2 Medium sample size ... 18

4.3 Large sample size ... 19

5. Conclusion ... 21

References ... 23

Appendix A. Additional Simulations ... 25

Alternative 1 ... 25

Alternative 2 ... 26

(4)

1 1. Introduction

The wide use of Analysis of Variance (ANOVA) for inference may stem from the applicability and the relevance of comparing means between groups. One-way ANOVA is a straightforward approach and yields interpretable results, enabling the researcher to make inferences about population means based on sample means.

The aptness, however, is dependent on a number of assumptions being met. As for all parametric tests, parameters and estimates are a foundation of ANOVA. This creates a sensitivity towards non-normality, both in parameters and errors, and unequal variances. These assumptions need to be met if the ANOVA is to be considered reliable (Ramsey et al, 2011). There has been a fair amount of academic interest in these assumptions and how violations affect results. Methods have been developed to provide researchers with more robust methods using trimming and bootstrapping of data (Keselman et al., 2004). However, Ramsey et al (2011) argue that these “super-robust” methods have problems dealing with outliers. Trimming, for example, sacrifices large portions of data to remove outliers but then the issue is that outliers might contain important information. While academia has long been aware of the lack of robustness and potentially skewed results which can be consequences of outliers in ANOVA, there is not a great amount of published academic research on the actual effect of outliers on ANOVA.

(5)

2

that kurtosis is finite (p.173). The reason behind this view is the risk of misleading results due to large outliers.

Overall there is little opposition towards the view that standard ANOVA-techniques are sensitive to outliers. But the method of simply deleting deviating values without further consideration is opposed by many researchers. Sheskin (2000) states that the presence of one or several outliers can have a substantial effect on both the mean and the variance of a distribution. If the variance is impacted by outliers test-statistics might not be reliable. Zimmerman & Zumbo (2009) discuss how the pooling of variances done in ANOVA distort the probability of making a type I error when variances are unequal. There are alternative tests available, such as the test presented by Welch (1938), which do not pool variances. Research has shown that Welch’s test can manage unequal variances (Delacre et al. 2017) but there is less evidence about how it performs when outliers are included.

In his bachelor’s thesis, Halldestam (2016) examines the robustness of the parameter estimate in One-way ANOVA when outliers are present through a simulation study. He finds that outliers increase the type I error probability, concluding that the parameter estimates are not robust. However, the outliers in his study are fixed to a single, rather extreme value. There may be opportunities for deepening the understanding by investigating outliers of varying magnitudes. Expanding the evidence may provide researchers with more guidance when they are faced with outliers in analysis of variance, potentially decreasing the loss of valuable information stemming from deletion of outliers.

The purpose of this study is therefore to investigate how outliers of different magnitudes influence the special case of One-way ANOVA, known as the Two-Sample t-test. The research questions are:

What are the influence of outliers of different magnitude on the type I error rate in the two sample t-test? Are there any differences in robustness between Welch’s t-test and Student’s t-test when outliers are present?

(6)

3 2. Theory and methodology

In practice assumptions related to statistical tests are rarely perfectly satisfied. It is therefore important to know whether a statistical method is robust to particular violations of assumptions (Agresti & Finlay, 2014, 122). For this purpose a simulation study is arguably suitable. The advantages of simulation studies have been clear since the 1930s as it gives the researcher ability to controlthe underlying data (Stigler, 1977). This study turns the focus towards outliers and the simulation approach enables control of the distribution of the data since it is generated through a computer program instead of collected. Therefore outliers can be added to samples drawn from a known normal distribution, which makes it easier to capture the effect of outliers without

uncertainty regarding the distribution of the underlying population.

However, it is appropriate to first establish a foundation regarding what “outlier” means in practice.

2.1 What is an outlier?

There is no universal definition of what actually constitutes an outlier. Barnett & Lewis (1984:4) provide the following definition in their book titled “Outliers in Statistical Data”.

“An outlier is an observation (or subset of observations) which appears to be inconsistent with

the remainder of that set of data.”

The authors discuss the wording “appears to be inconsistent” as it means that the definition of an outlier is dependent on subjective interpretation. This view is shared with other researchers (Hampel et al., 1986, Sheskin, 2000).

(7)

4

from the box. The outer fences are 3 times the hinge-spread out from the box. The observations inside the inner fences are categorized as adjacent. Observations between the inner and outer fences are categorized as outside values and values outside the outer fences are categorized as far

out. (Tukey, 1977, 44)

The separation of far out values (extreme outliers) and outside values (outliers) is suitable for this study. The aim is to investigate the influence of outliers of varying magnitude. In contrast,

Ramsey et al. (2011) and Halldestam (2016) both use outliers which are significantly beyond the outer fences. Halldestam (2016) set outliers to approximately two times the outer fence and Ramsey et al. (2011) set outliers to five standard deviations from the mean. In practice,

researchers can be faced with both extreme outliers and outliers that are marginally beyond the fences. Researchers may feel comfortable dealing with extreme outliers, where the effects on p-values are known to be substantial, as compared to outliers of varying magnitude where the influence on the results of the analysis is more opaque. This study will contribute to the understanding of how different kinds of outliers influence ANOVA.

2.2 Methodology

The methodological foundation is Analysis of Variance. In this study, the main focus will be on the special case of ANOVA, named Two-Sample t-test. As the name implies, the analysis is made on difference in means between two populations based on group sample means. For this purpose both Student’s t and the modification, Welch’s t is tested. The reason for applying two alternative tests is that the addition of outliers causes the group variances to deviate, which is a violation of one of the assumptions of ANOVA. It has been shown that Welch’s t is more robust in cases of unequal variances (Delacre et al. 2017).

In this section, the aforementioned methods are presented. First, the general case of ANOVA is visited, followed by the modified versions of ANOVA, Student’s t-test and Welch’s t-test.

2.2.1 ANOVA

The motive behind using ANOVA in research is to find out whether there are statistically

(8)

5

differences between the two fertilizer types. However, it is not suitable to make any general remarks by only looking at the group sample means. For generalization to the population, one needs to draw inference by investigating whether the means in the sample are representative of the means in the population, taking into account the random variation of the sample means. ANOVA is used to accomplish this goal.

The standard formula for the one-way ANOVA is:

𝑌𝑖𝑗 = 𝜇𝑖 + 𝛼𝑗 + 𝜀𝑖𝑗 where 𝑖 = 1, … , 𝑘 𝑎𝑛𝑑 𝑗 = 1, … , 𝑛𝑖

and where µ is the overall mean, α is the differential effect of the ith treatment, and ε is the error term (Scariano & Davenport, 1987).

The aim of ANOVA is to find out whether there are any differences between group means. To do that, ANOVA makes use of the variability in the data. This variability quantifies the spread of the individual observations around the mean. The variability is made up of two quantities, which in ANOVA are kept separate. The first quantity is the sum of squares, and the other is the degrees of freedom associated with the sum of squares. The degrees of freedom is interpreted as the number of independent pieces of information that contributes to a statistic, in this case, the variability around the mean.

The first step is to calculate a grand mean of the total sample. The next step is to calculate group means for each group. If there are significant differences between group means, the grand mean will include a lot of variability while the group means will explain most of the variability existent in the group mean. If there are no differences between the groups, the group means will not remove the variability in the grand mean. So at what threshold does the ANOVA reject the null hypothesis of no significant difference between the groups? How much of the variability in the grand mean must be removed by the fitting of the group means to obtain a significant result? To answer this, the ANOVA-statistic needs to be broken down into its mathematical components. The breakdown is done to separate the variability between the groups and the variability within the groups. The between group sum of squares (SSB) is calculated through squaring the

(9)

6

𝑆𝑆𝐵 = ∑ (𝑥̅ − 𝑥̅𝑗 𝑔𝑟𝑎𝑛𝑑)2

# 𝑜𝑓 𝑔𝑟𝑜𝑢𝑝𝑠 𝑗=1

The sum of squares within groups (SSW) is calculated by squaring the deviation of each

observation from the group mean and pooling over groups. This gives a measure of the variation within each of the groups.

𝑆𝑆𝑊 = ∑ ∑(𝑥𝑖𝑗 − 𝑥̅𝑗)2

𝑛𝑗

𝑖=1 # 𝑜𝑓 𝑔𝑟𝑜𝑢𝑝𝑠

𝑗=1

The SSB and the SSW summed together is, logically, the total sum of squares (SST) That is, the sum of squared deviations of each observation around the grand mean.

𝑆𝑆𝑇 = ∑(𝑥𝑖 − 𝑥̅𝑔𝑟𝑎𝑛𝑑)2

𝑛 𝑖=1

Or, 𝑆𝑆𝑇 = 𝑆𝑆𝐵 + 𝑆𝑆𝑊

At this point, it is possible to analyze the variability between the group means and see if there are any distinguishable differences. However, for a thorough statistical comparison it is not enough to only look at the variability. The variability needs to be adjusted for the degrees of freedom. In doing this, the sum of squares is transformed into variance.

Starting with the variance between different groups adjusted for the number of groups, that is k-1 degrees of freedom, gives the Mean Square between groups.

𝑀𝑆𝐵 =∑ (𝑥̅ − 𝑥̅𝑗 𝑔𝑟𝑎𝑛𝑑) 2 # 𝑜𝑓 𝑔𝑟𝑜𝑢𝑝𝑠 𝑗=1 𝑘 − 1 = 𝑆𝑆𝐵 𝑘 − 1

A similar adjustment is made for SSW but here n-k is used to adjust for sample size. The

(10)

7 𝑀𝑆𝑊 =∑ ∑ (𝑥𝑖𝑗− 𝑥̅𝑗) 2 𝑛𝑗 𝑖=1 # 𝑜𝑓 𝑔𝑟𝑜𝑢𝑝𝑠 𝑗=1 𝑛 − 𝑘 = 𝑆𝑆𝑊 𝑛 − 𝑘

The total variance in the data, also called Total Mean Square (MST) is calculated by combining the total sum of squares and the degrees of freedom. Here n-1 is the degrees of freedom.

𝑀𝑆𝑇 = ∑ (𝑥𝑖− 𝑥̅𝑔𝑟𝑎𝑛𝑑) 2 𝑛 𝑖=1 𝑛 − 1 = 𝑆𝑆𝑇 𝑛 − 1

After obtaining variances, the next step is to formally test whether the differences are statistically significant. For ANOVA with more than two groups, this is done through the F-ratio. If there are no differences between the groups, the variation between the groups would be similar to the variation within the groups. Thus the F-ratio is calculated as:

𝐹 = 𝑀𝑆𝐵 𝑀𝑆𝑊

An F-ratio above 1 suggests that the group means are different because the variance between the groups is larger than the variance within the groups. If the F-ratio is below 1, the interpretation is the opposite. Since the variation within the group is larger than the variation between the groups it is difficult to distinguish any differences between the groups.

In statistics, the p-value is often of large interest and to obtain a p-value for the ANOVA, the F-distribution is used to lookup the p-value that corresponds to the F-ratio and the calculated degrees of freedom connected to both the within-group and the between-group sum of squares. A p-value below 5% is said to reject the null hypothesis of no difference in means between the population groups. A p-value above 5% means that there is not enough evidence to reject the null hypothesis that the population groups are equal in means.

2.2.2 Student’s t-test.

(11)

8 𝑡∗ = 𝑋̅1− 𝑋̅2 𝑠𝑝𝑛1 1+ 1 𝑛2

Where 𝑠𝑝 represents the pooled standard deviation for the equally sized groups and the two variances:

𝑠𝑝 = √(𝑠𝑥21+ 𝑠𝑥22)

2

The t-statistic is then checked against the t-distribution to retrieve the p-value for the test. If a low p-value is obtained, the researcher can reject the null hypothesis of no difference in means in favor of the alternative hypothesis that the population group means differ.

2.2.3 Welch’s t-test

Adding outliers will affect the variances. Equal variances is an assumption in One-way ANOVA and Student’s t-test. Jan & Shieh (2014) state that research has shown that the F-test is not robust to heterogeneous populations and that actual significance levels can be distorted even in cases when group sizes are equal.

To control for this, one option available is to utilize the Welch-Satterthwaite equation through a Welch’s t-test. It is a common view among researchers that Welch’s test is the most appropriate method when there is heterogeneity in the data (Moser & Stevens, 1992). Jan & Shieh (2014) state that Welch’s t test maintains control over type I error rates when variances are unequal. The test performs a pooling of the degrees of freedom so that they correspond to the pooled variance (Welch, 1938). The outcome is a more reliable statistic when the variances are unequal. The Welch’s t-statistic is arrived at through:

(12)

9

And where the pooled degrees of freedom is calculated as,

𝑑. 𝑓. = (𝑠12 𝑛1 + 𝑠22 𝑛2) 2 (𝑠1 2 𝑛1) 2 𝑛1− 1 + (𝑠2 2 𝑛2) 2 𝑛2 − 1

Agresti & Finlay (2014) argue that outliers create a case where the t-test is not suitable. This is due to outliers impacting the mean and making the mean a poor representative of the center of the distribution (p.122). However, Agresti & Finlay (2014) do not present any empirical evidence in connection to that statement. Seely et al. (2003) argue that a single outlier added in their dataset would inflate the estimate of variation and also bias the arithmetic mean towards the outlier-value. This is used as motivation for using an alternative test statistic, but little is mentioned about the possible variation in severity depending on the magnitude of the outlier. Delacre et al. (2017) argue that Welch’s t test should always be applied instead of Student’s t because of the enhanced robustness.

(13)

10 3. Simulation study

For the purpose of this study, a simulation approach is suitable. The choice is supported by previous studies using simulations when analyzing ANOVA. Delacre et al. (2017) perform a simulation study in which they investigate the difference in type I error rate between the Student’s t-test and the Welch’s t-test. Halldestam (2016) also performs a simulation study in which he adds outliers to sample-groups and analyzes the type I error rate.

Delacre et al. (2017) show that when variances are unequal, but the group sizes are equal,

Student’s t-test manages to maintain good control over the type I-error rate. However, Delacre et al. (2017) argue that overall, Welch’s t-test provides more stable type I-error rates compared to Student’s t-test. As discussed earlier, the addition of outliers impacts the equal variance

assumption. However, deviant observations also affect the means of sample-groups. This aspect is not addressed in the Delacre et al. (2017) study since they exclusively focus on unequal variances.

Halldestam (2016) introduces extreme outliers in the groups, which distort the variance

homogeneity assumption of the one-way ANOVA. However, the design is balanced as there is an equal number of observations in each of the three groups. So while Delacre et al. (2017)

investigate differences between Student’s t and Welch’s t due to unequal variances, and

Halldestam (2016) analyzes outliers, there is a possibility to expand the results of both of these studies by jointly investigating outliers of varying magnitude and how well the two t-statistics can deal with the outlier-effects on sample group means and variances.

(14)

11

The first step of the simulation is that a standard normal population is generated using R. From this population a loop is set up. The process starts with the drawing of a sample consisting of two groups from the population. One or more outliers are then added to one of the groups. Then Student’s t-test and Welch’s t-test are performed on the drawn sample, and the p-values of these tests are stored. One p-value for each test is generated for every replication performed. Both Ramsey et al. (2011) and Halldestam (2016) perform 10,000 replications. 10,000 replications is arguably an appropriate trade-off between time requirement and generalizability.

The significance level for the tests is 0,05 which should translate to an average null rejection rate of 0,05 if outliers have no effect on the results. This is because the two population groups are generated from a normal distribution with mean zero and variance 1. The 10 000 stored p-values for each of the two t-tests are then analyzed in relation to the significance level. Calculating the mean of the p-values yields a mean type I error rate, which is the percentage of tests rejecting the null hypothesis of no difference between the group means when there is in fact no difference in group means.

Depending on the resulting type I error rate, conclusions can be made about the effect of outliers. It is therefore relevant to provide a threshold for what value can be interpreted as significantly deviant from the 0,05 significance level. This can be done using a confidence interval (CI) (Ramsey et al., 2011).

Due to the design of the study, with varying outlier-values and sample size settings, a number of type I error rates are generated and analyzed. As a consequence of the large number of

comparisons between the 0,05 significance level and the simulated p-value, the risk of incorrectly declaring that the simulated value differs from 0,05 inflates. In order to, at least partly, address the inflation a 99% confidence interval is established. This means that the interval will be wider than a 95% CI and thus the simulated type I error rates needs to be further away from the 0,05 significance level in order to be significantly deviant.

(15)

12

𝑀𝐸 = 2,5768𝜎𝑝̂ = 0,0056

99 % 𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝐼𝑛𝑡𝑒𝑟𝑣𝑎𝑙: [0,0444 ∶ 0,0556]

This means that type I error rates below 4,44% or above 5,56% will be considered as influenced by the outliers added into the samples. Ramsey et al. (2011) also refer to Bradley’s liberal criterion of robustness (1978), which states that no statistic can be considered robust if the type I error rate is outside the interval 0,5α – 1,5α, that is outside the interval [0,025 : 0,075]. This criterion is used in addition to the 99% CI.

3.1 Sample size

An important consideration for the study is the sample size. If sample size is small, researchers need to be particularly cautious when it comes to outliers as they can affect the validity of the mean as a center (Agresti & Finlay, 2014, p.128). Large sample sizes are especially important when extreme outliers are present as averaging effects can mitigate the outlier impact (Boneau, 1960). Therefore it is arguably relevant to perform the analysis in different sample size settings. Sample sizes used for simulating are determined using the technique presented by Cohen (1988).

𝑃𝑒𝑟 𝑔𝑟𝑜𝑢𝑝 𝑛 = 16 (𝜎 𝑑)

2

Where the effect size, 𝑑 = {

0,2𝜎 𝑠𝑚𝑎𝑙𝑙 𝑒𝑓𝑓𝑒𝑐𝑡 𝑠𝑖𝑧𝑒 0,5𝜎 𝑚𝑒𝑑𝑖𝑢𝑚 𝑒𝑓𝑓𝑒𝑐𝑡 𝑠𝑖𝑧𝑒

0,8𝜎 𝑙𝑎𝑟𝑔𝑒 𝑒𝑓𝑓𝑒𝑐𝑡 𝑠𝑖𝑧𝑒 Which yields three different per group sample sizes.

𝑃𝑒𝑟 𝑔𝑟𝑜𝑢𝑝 𝑛 = { 16 ( 𝜎 0,2 ∗ 𝜎) 2 = 400 16 ( 𝜎 0,5 ∗ 𝜎) 2 = 64 16 ( 𝜎 0,8 ∗ 𝜎) 2 = 25

The calculated sample size of 25 is rounded up to 30 to make use of the Central Limit Theorem. Thus, three different per group sample sizes are simulated and analyzed: 30, 64 and 400

(16)

13 3.2 Outlier values

Both Ramsey et al. (2011) and Halldestam (2016) use fixed-value outliers. Ramsey et al. (2011) set outlier values to 𝜇 + 5𝜎 while Halldestam (2016) use outliers with values ±10. This value is arrived at through doubling of the x-values of the outer fences (i.e. 3 IR from the box). Referring back to the Tukey (1977) classification, these outliers are categorized as far out and they can even be considered as extreme outliers. Since the aim of the current study is to investigate how outliers of different magnitudes influences the two t-tests of Student and Welch, using a single outlier value is not appropriate.

In order to determine the values of the multiple outliers used in this study, the boxplot presented by Tukey (1977) is utilized. In the population generated, the first and third quartile are -0,674 and 0,674 respectively. That makes an IR of 1,348. The IR is then utilized to amplify outlier values in a stepwise fashion. The calculated outlier values are presented in Table 1 below.

Table 1

- Outlier values

Due to symmetry, the calculations for the negative outlier values are not shown.

Outlier (i) 𝑄3+ 2 𝐼𝑅 0,674 + 2 ∗ 1,348 ≈ 𝟑, 𝟒

Outlier (ii) 𝑄3+ 3 𝐼𝑅 0,674 + 3 ∗ 1,348 ≈ 𝟒, 𝟕

Outlier (iii) 𝑄3+ 4 𝐼𝑅 0,674 + 4 ∗ 1,348 ≈ 𝟔, 𝟏

Outlier (iv) 𝑄3+ 5 𝐼𝑅 0,674 + 5 ∗ 1,348 ≈ 𝟕, 𝟒

Outlier (v) 𝑄3+ 6 𝐼𝑅 0,674 + 6 ∗ 1,348 ≈ 𝟖, 𝟖

(17)

14 3.3 Simulation set up

In this section the structure of the simulations is presented. It is shown how the outliers calculated in the previous section are added into samples.

One possible simulation set up strategy is the one used by Ramsey et al. (2011). They run through each of the observations in their dataset and at each data point there is a 5% chance of it being replaced by an outlier. Doing this for all of the observations yields a dataset where approximately 1 out of 20 observations is an outlier. A similar approach is possible for the current study as well. However, the loss of control due to random inclusion of outliers is not suitable for the current study. Halldestam (2016) establishes a pre-specified inclusion-scheme and adds outliers to the three groups in the one-way ANOVA accordingly.

Adopting the approach of Halldestam (2016) enables more control over the different outlier scenarios. In addition, it makes it easier to compare the results of different outlier values since there are no differences between how they are added to the groups. The situations in Table 2 illustrate how outliers are added to the samples. For each scenario, samples are drawn to leave out the same number of observations which are to be added as outliers. For example, the total sample drawn for Situation 2A contains 59 observations divided by the two groups. The outlier is then added to balance the group sizes.

Table 2 A B

Situation 1 No Outliers (reference group)

Situation 2 One positive (i) outlier in one of the groups. Two positive (i) outliers in one of the groups.

Situation 3 One positive (ii) outlier in one of the groups. Two positive (ii) outliers in one of the groups.

Situation 4 One positive (iii) outlier in one of the groups. Two positive (iii) outliers in one of the groups.

Situation 5 One positive (iv) outlier in one of the groups. Two positive (iv) outliers in one of the groups.

Situation 6 One positive (v) outlier in one of the groups. Two positive (v) outliers in one of the groups.

(18)

15

Halldestam (2016) finds that one extreme outlier added to one on the three groups in the

ANOVA-setting yields a type I error rate below 5%. This is described as a possible special case. Zimmerman (1994) found that the probability of a Type I error declined to 3% for simulated two sample t-tests. He argued that this was due to both the extremity of the outlier and the probability of an outlier being included in the sample. These findings motivate analyzing what effect one outlier has on the type I error rate (column A in Table 2), as it will impact both the mean and the variance. Halldestam (2016) further shows that the type I error rate increases with the addition of multiple extreme outliers in his one-way ANOVA setting. These results motivate the choice of including two outliers in one of the sample groups (column B in Table 2).

(19)

16 4. Results

In this section the results of the simulations are presented. The presentation will be divided by the three alternative sample sizes.

4.1 Small sample size

The small sample size contains a total of 60 observations divided by the two groups. The tables contain the fraction of tests rejecting the null hypothesis of no difference between the population groups. The generated population is normally distributed and there is no actual difference

between the population groups. Table 3

- One Outlier - Small sample

Type I error rate when one outlier is present.

CI intervals are [0,0444 ; 0,0556] and values beyond the limits are marked in bold. Values beyond Bradley’s (1978) criterion [0,025 ; 0,075] are marked with an asterisk. No Outliers Outlier (i) Outlier (ii) Outlier (iii) Outlier (iv) Outlier (v)

Welch’s t 0,0550 0,0471 0,0456 0,0415 0,0363 0,0316

Student’s t 0,0546 0,0478 0,0472 0,0429 0,0392 0,0340

Table 3 shows that for the small sample size the type I error rate seems to decrease with the addition of one outlier. The reference group with no outliers in the sample yield a fraction of around 5,5% which is right on upper limit for the confidence interval. This is somewhat surprising as it should be close to 5% but it is interpreted as due to chance. The two mildest outliers show type I error rates close to 5%, suggesting that there is no significant influence of a single mild outlier, even when the sample size is small. However, the three more severe outliers indicate that the fraction of tests rejected actually decreases with a single outlier present. The type I error rate for these outliers are beyond the lower limit of 4,44% in the 99% CI. This means that the test actually decreases the risk of making a type I error, which is in line with the results of the Halldestam (2016) study.

(20)

17 Table 4

- Two Outliers - Small sample

Type I error rate when two outliers are present.

CI intervals are [0,0444 ; 0,0556] and values beyond the limits are marked in bold. Values beyond Bradley’s (1978) criterion [0,025 ; 0,075] are marked with an asterisk. No Outliers 2 Outlier (i) 2 Outlier (ii) 2 Outlier (iii) 2 Outlier (iv) 2 Outlier (v)

Welch’s t 0,0502 0,0809* 0,0888* 0,0898* 0,0841* 0,0769*

Student’s t 0,0500 0,0805* 0,0904* 0,0926* 0,0885* 0,0819*

Turning focus towards Table 4, where instead of including only one outlier, two outliers are added in one of the two groups. The No Outlier situation yields a type I error rate of close to 5%, as expected. The fractions when outliers are present shows that the type I error rate deteriorates in the small sample size setting as all are above the 5,56% upper CI-limit. All type I error rates are even beyond the upper limit of Bradley’s (1978) robustness criterion of 0,075. Therefore neither of the tests can be considered robust when more than one outlier are present in one of the groups and sample sizes are small.

The highest type I error rate belongs to the group containing two outlier (iii). The fraction for Student’s t is 9,26% and Welch’s t is marginally below 9%. In practice this means a significantly increased risk for rejecting the null hypothesis even though there is no difference in the

population. It further seems that the fraction first increases with the distance from the third quartile, but only up until 4 IR. Outliers set at 5 and 6 IR from the “box” (Outlier (iv) & (v)) then seems to decrease the type I error rate again, although not to an extent that would suggest

robustness.

(21)

18

together with the first outlier it can alter the mean and lead to larger t-statistics from the tests, making it easier to find significant differences between the groups.

Comparing the two test statistics shows that the Welch’s t-test is again yielding lower type I error rates in the case of two present outliers in the small sample size setting. Overall this suggests that when the sample size is small and one or two outliers are present in the data, the Welch’s t-test should be the preferred choice. This is consistent with the results presented by Delacre et al. (2017).

4.2 Medium sample size

The medium sample size contains a total of 128 observations divided by the two groups. Table 5

- One Outlier - Medium sample

Type I error rate when one outlier is present.

CI intervals are [0,0444 ; 0,0556] and values beyond the limits are marked in bold. Values beyond Bradley’s (1978) criterion [0,025 ; 0,075] are marked with an asterisk. No Outliers Outlier (i) Outlier (ii) Outlier (iii) Outlier (iv) Outlier (v)

Welch’s t 0,0516 0,0480 0,0482 0,0478 0,0468 0,0441

Student’s t 0,0516 0,0475 0,0479 0,0477 0,0467 0,0437

Table 5 shows that the reference group with no outliers yields an expected fraction close to the significance level of 5%. Type I error rates, in similarity to the small sample case, are below 5% for all different outlier values. In addition, the fraction of rejected tests declines when the

magnitude of the outlier is increased. Although the tendency is not as distinct as in the small sample setting, which is reflected by that only outlier (v) shows type I error rates lower than 4,44%.

(22)

19 Table 6

- Two Outliers - Medium sample

Type I error rate when two outliers are present.

CI intervals are [0,0444 ; 0,0556] and values beyond the limits are marked in bold. Values beyond Bradley’s (1978) criterion [0,025 ; 0,075] are marked with an asterisk. No Outliers 2 Outlier (i) 2 Outlier (ii) 2 Outlier (iii) 2 Outlier (iv) 2 Outlier (v)

Welch’s t 0,0503 0,0678 0,0778* 0,0843* 0,0874* 0,0875*

Student’s t 0,0503 0,0680 0,0779* 0,0851* 0,0892* 0,091*

Adding a second outlier to the medium sample size of 128 observations leads to an increased type I error rate. Table 6 illustrates that even two mild outliers added to one group increase the type I error rate to around 7%, which is above the upper limit of the 99% confidence interval. The fraction increases when the magnitude of the outlier increases. The highest type I error rate in the medium sample setting is approximately 9% and generated by two outlier (v) at 6 IR from the “box”.

Comparison of the type I error rate between the two test statistics does not show any apparent differences, although Welch’s t yields a marginally lower fraction for each situation. Overall, the medium sample size shows similar results to the smaller sample size. The same phenomenon with decreasing type I error rates is observed for the medium sample size as in the small sample size setting when a single outlier is added. The addition of another identical outlier in each situation increase the type I error rate towards 9% which is similar to the small sample setting. However, while there arguably were distinguishable differences between the two test statistics in the small sample setting, the medium sample shows no distinct differences.

4.3 Large sample size

The large sample size contains 800 observations divided by the two groups. Again, the results presented in Table 7 and 8 are the fraction of tests which rejected the null hypothesis of no difference between the population groups, out of the 10 000 replications performed. Table 7

- One Outlier - Large sample

Type I error rate when two outliers are present.

CI intervals are [0,0444 ; 0,0556] and values beyond the limits are marked in bold. Values beyond Bradley’s (1978) criterion [0,025 ; 0,075] are marked with an asterisk. No Outliers Outlier (i) Outlier (ii) Outlier (iii) Outlier (iv) Outlier (v)

Welch’s t 0,0504 0,0473 0,0486 0,0492 0,0488 0,0492

(23)

20

Table 7 indicates that adding a single outlier into one of the groups when the sample size is large does not affect the type I error rate. The mildest outlier yields a fraction of 4,7% and the most extreme, located 6 IR from the “box” is just below 5%. All of the fractions are within the limits of the 99% confidence interval. The presence of a single outlier, as long as it is not a severely extreme value, will thus not influence the type I error rate when the sample size is around 400 per group.

The tendency of diminishing differences between the two test statistics observed in the medium sample setting continues in the large sample size setting. As illustrated in Table 7, the type I error rate of Welch’s t and Student’s t are almost identical and both are able to maintain control over the fraction of rejections. It seems as if the advantage of Welch’s t test is most apparent when sample sizes are smaller.

Table 8

- Two Outliers - Large sample

Type I error rate when two outliers are present.

CI intervals are [0,0444 ; 0,0556] and values beyond the limits are marked in bold. Values beyond Bradley’s (1978) criterion [0,025 ; 0,075] are marked with an asterisk. No Outliers 2 Outlier (i) 2 Outlier (ii) 2 Outlier (iii) 2 Outlier (iv) 2 Outlier (v)

Welch’s t 0,0521 0,0496 0,0556 0,0608 0,0655 0,0693

Student’s t 0,0518 0,0496 0,0556 0,0604 0,0655 0,0698

The effect of adding two outliers of each magnitude into one of the two groups is illustrated in Table 8. There is no clear difference between the reference group and adding two of the mildest outlier (i) as the type I error rates are 5,21% and 4,96% respectively. For the other outlier

(24)

21 5. Conclusion

The aim of this study was to investigate how outliers of different magnitude influence the type I error rate for two sample t-tests. For this purpose, the two alternative test statistics Student’s t-test and Welch’s t-test were simulated and analyzed.

The results of the simulations performed shows that there are differences in how outliers of different magnitude impact the type I error rate. The effect of one outlier added to one of the two sample groups depends on the sample size. When samples are small, the type I error rate deviates significantly for the three most extreme outliers. The medium sample manages to better resist the impact of outliers, but the most extreme outlier still causes a significantly deviant type I error rate. The large sample size setting absorbs the effect of a single outlier and there are no significant deviations.

When faced with a single outlier in the data, researchers must therefore consider both the magnitude of the outlier and the sample size. If the sample consists of a large number of observations the evidence in this study shows that both Student’s t-test and Welch’s t-test are robust. However, when sample sizes are smaller, a single outlier can impact the average probability of incorrectly rejecting the null hypothesis.

The influence of two outliers added to one of the two sample groups provides more distinct evidence. The type I error rate is significantly deviating for every magnitude and sample size, with one exception. Adding two of the mildest outliers does not cause a significantly deviating type I error rate in the large sample size setting. Although the effects are mitigated by increasing the sample size, the results show an unequivocal sensitivity towards two outliers in the same sample group. Researchers must therefore tread with caution when faced with multiple outliers in samples of data. Support for this conclusion was also found in three additional simulations in which the outliers were added to the groups in alternative ways. The additional simulations are presented in Appendix A.

Throughout the simulations performed, the Welch’s t-test yields more robust type I error rates than the Student’s test. The comparison of the two alternative statistics suggest that Welch’s t-test should be the preferred choice as it shows a superior control of type I error rates. The

(25)

22

Investigating the statistical power in the presence of outliers might be a possible topic for future research. Another suggestion for future research is to generate and investigate how outliers impact other distributions than the standard normal used in this study.

(26)

23 References

Agresti, A. & Finlay, B. (2014). “Statistical Methods for the Social Sciences”. Pearson Educated Limited: Essex.

Barnett, V. & Lewis, T. (1984). Outliers in Statistical Data, Second Edition. John Wiley & Sons.

Boneau, C. A. (1960). “The Effects of Violations of Assumptions Underlying the t Test”.

Pshychological Bulletin, vol. 57(1), pp. 49-64.

Bradley, J.V. (1978). “Robustness?”. British Journal of Mathematical and Statistical Psychology, vol.31(2), pp. 144-152.

Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences, Second Edition. Academic Press:New York.

Delacre M., Lakens, D. & Leys, C. (2017). “Why Psychologists Should by Default Use

Welch’s t-test Instead of Student’s t-test”. International Review of Social Psychology, vol. 30(1), pp. 92-101.

Halldestam, M. (2016). “ANOVA – The Effect of Outliers”. Bachelor’s thesis, Department of

Statistics at Uppsala University.

Hampel, F.R. (1986). “Robust Statistics: The Approach Based on Influence Functions”. Wiley: New York.

Hodge, V. & Austin, J. (2004). "A Survey of Outlier Detection Methodologies". Artificial

Intelligence Review, vol. 22, no. 2, pp. 85-126.

Jan, S-L. & Shieh, G. (2014). “Sample Size Determinations for Welch’s Test in One-Way Heteroscedastic ANOVA”. British Journal of Mathematical and Statistical Psychology, vol. 67, pp. 72-93.

(27)

24

Moser, B. K & Stevens, G.R. (1992). ”Homogeneity of Variance in the Two-Sample Means Test”. The American Statistician, vol. 46(1), pp. 19-21.

Ramsey, P.H., Barrera, K, Hachimine-Semprebom, P. & Liu, C-C. (2011). “Pairwise

Comparisons of Means Under Realistic Nonnormality, Unequal Variances, Outliers And Equal Sample Sizes”. Journal of Statistical Computation and Simulation, vol 81(2), pp. 125-135.

Scariano, S.M. & Davenport, J.M. (1987). “The Effects of Violations of Independence Assumptions in the One-Way ANOVA”. The American Statistician, vol. 41(2), pp. 123-129.

Seely, R.J., Munyakazi, L., Haury, J. & Simmerman, H. (2003). "Application of The Weisberg t-Test For Outliers". Pharmaceutical Technology Europe, vol. 15, no. 10, pp. 37.

Sheskin, D.J. (2000). “Handbook of Parametric And Nonparametric Statistical Procedures”. 2th edn, Chapman & Hall/CRC Press: Boca Raton.

Stigler, S.M. (1977). “Do Robust Estimators Work With Real Data?”. The Annals of Statistics, vol. 5(6), pp. 1055-1098.

Stock , J.H. & Watson, M.W. (2015). “Introduction to Econometrics”. 3. rev, Global edn. Pearson Education, Harlow.

Tukey, J.W. (1915-2000 1977), “Exploratory Data Analysis”. Addison-Wesley: Reading, Mass.

Welch, B. L. (1938). “The Significance Of The Difference Between Two Means When The Population Variances Are Unequal”. Biometrika, vol. 29(3/4), pp. 350-362.

Zimmerman, D.W. (1994). “A Note on The Influence Of Outliers On Parametric And Nonparametric Tests”. The Journal of General Psychology, vol 121(4), pp. 391-396.

(28)

25 Appendix A. Additional Simulations

In this appendix, additional simulations are presented. These are included to illustrate alternative “Situations” from the ones chosen and analyzed more thoroughly in the study. The results from these additional situations are presented below.

Alternative 1 – Including one positive outlier in Group A and one negative outlier of the same magnitude in group B.

Tables 9-11 show that both t-tests are sensitive to this type of outlier presence. Type I error rates are significantly deviant for all but the two mildest magnitudes in the large sample. The statistics are seemingly not robust when a positive and a negative outlier are included.

Table 9

- 2 Outliers - Small sample

Type I error rate. One positive in one group and one negative in the other.

CI intervals are [0,0444 ; 0,0556] and values beyond the limits are marked in bold.

Values beyond Bradley’s (1978) criterion [0,025 ; 0,075] are marked with an asterisk.

No Outliers Outlier (i) Outlier (ii) Outlier (iii) Outlier (iv) Outlier (v)

Welch’s t 0,0490 0,0780* 0,0830* 0,0826* 0,0761* 0,0655

Student’s t 0,0495 0,0798* 0,0870* 0,0873* 0,0827* 0,0724

Table 10

- 2 Outliers

- Medium Sample

Type I error rate. One positive in one group and one negative in the other.

CI intervals are [0,0444 ; 0,0556] and values beyond the limits are marked in bold.

Values beyond Bradley’s (1978) criterion [0,025 ; 0,075] are marked with an asterisk.

No Outliers Outlier (i) Outlier (ii) Outlier (iii) Outlier (iv) Outlier (v)

Welch’s t 0,0523 0,0647 0,0746 0,0824* 0,0869* 0,0854*

Student’s t 0,0526 0,0646 0,0755* 0,0833* 0,0857* 0,0877*

Table 11

- 2 Outliers - Large Sample

Type I error. One positive in one group and one negative in the other.

CI intervals are [0,0444 ; 0,0556] and values beyond the limits are marked in bold.

Values beyond Bradley’s (1978) criterion [0,025 ; 0,075] are marked with an asterisk.

No Outliers Outlier (i) Outlier (ii) Outlier (iii) Outlier (iv) Outlier (v)

Welch’s t 0,0517 0,0511 0,0529 0,0599 0,0600 0,0645

(29)

26

Alternative 2 - Including one positive outlier and one negative outlier of the same magnitude in the same group.

When the positive and negative outlier are included in the same group, Table 12-14 show that average type I error rates are significantly deviant below the 5%-level. This is not surprising since the variance is inflated while the effect on the mean may be offset by the positive and negative outlier-values. As shown in Alternative 1, the statistics are seemingly not robust when both positive and negative outliers are included.

Table 12

- 2 Outliers - Small sample

Type I error rate. One positive and one negative outlier in the same group. CI intervals are [0,0444 ; 0,0556] and values beyond the limits are marked in bold. Values beyond Bradley’s (1978) criterion [0,025 ; 0,075] are marked with an (*).

No Outliers Outlier (i) Outlier (ii) Outlier (iii) Outlier (iv) Outlier (v)

Welch’s t 0,0507 0,0201* 0,0091* 0,0035* 0,0008* 0,0002* Student’s t 0,0505 0,0189* 0,0084* 0,0023* 0,0006* 0,0001* Table 13 - 2 Outliers - Medium sample

Type I error rate. One positive and one negative outlier in the same group. CI intervals are [0,0444 ; 0,0556] and values beyond the limits are marked in bold. Values beyond Bradley’s (1978) criterion [0,025 ; 0,075] are marked with an (*).

No Outliers Outlier (i) Outlier (ii) Outlier (iii) Outlier (iv) Outlier (v)

Welch’s t 0,0498 0,0305 0,0206* 0,0101* 0,0057* 0,0030*

Student’s t 0,0496 0,0301 0,0205* 0,0099* 0,0057* 0,0021*

Table 14

- 2 Outliers - Large sample

Type I error rate. One positive and one negative outlier in the same group. CI intervals are [0,0444 ; 0,0556] and values beyond the limits are marked in bold. Values beyond Bradley’s (1978) criterion [0,025 ; 0,075] are marked with an (*).

No Outliers Outlier (i) Outlier (ii) Outlier (iii) Outlier (iv) Outlier (v)

Welch’s t 0,0526 0,0454 0,0423 0,0385 0,0352 0,0312

(30)

27

Alternative 3 – Including one positive outlier of the same magnitude in each of the two groups. Tables 15-17 show that the two tests are not robust in the scenario. Type I error rates are

significantly below 5%. The large sample setting seems to mitigate some of the effects. Table 15

- 2 Outliers - Small sample

Type I error rate. One outlier in one group and one in the other (Both positive). CI intervals are [0,0444 ; 0,0556] and values beyond the limits are marked in bold. Values beyond Bradley’s (1978) criterion [0,025 ; 0,075] are marked with an (*).

No Outliers Outlier (i) Outlier (ii) Outlier (iii) Outlier (iv) Outlier (v)

Welch’s t 0,0514 0,0193* 0,0071* 0,0026* 0,0006* 0,0002* Student’s t 0,0502 0,0202* 0,0079* 0,0025* 0,0007* 0,0002* Table 16 - 2 Outliers - Medium sample

Type I error rate. One outlier in one group and one in the other (Both positive). CI intervals are [0,0444 ; 0,0556] and values beyond the limits are marked in bold. Values beyond Bradley’s (1978) criterion [0,025 ; 0,075] are marked with an (*).

No Outliers Outlier (i) Outlier (ii) Outlier (iii) Outlier (iv) Outlier (v)

Welch’s t 0,0481 0,0314 0,0233* 0,0129* 0,0077* 0,0038*

Student’s t 0,0483 0,0314 0,0238* 0,0134* 0,0080* 0,0039*

Table 17

- 2 Outliers - Large sample

Type I error rate. One outlier in one group and one in the other (Both positive). CI intervals are [0,0444 ; 0,0556] and values beyond the limits are marked in bold. Values beyond Bradley’s (1978) criterion [0,025 ; 0,075] are marked with an (*).

No Outliers Outlier (i) Outlier (ii) Outlier (iii) Outlier (iv) Outlier (v)

Welch’s t 0,0457 0,0443 0,0420 0,0389 0,0361 0,0321

Student’s t 0,0458 0,0444 0,0420 0,0390 0,0360 0,0321

References

Related documents

Generally, a transition from primary raw materials to recycled materials, along with a change to renewable energy, are the most important actions to reduce greenhouse gas emissions

För att uppskatta den totala effekten av reformerna måste dock hänsyn tas till såväl samt- liga priseffekter som sammansättningseffekter, till följd av ökad försäljningsandel

Från den teoretiska modellen vet vi att när det finns två budgivare på marknaden, och marknadsandelen för månadens vara ökar, så leder detta till lägre

Generella styrmedel kan ha varit mindre verksamma än man har trott De generella styrmedlen, till skillnad från de specifika styrmedlen, har kommit att användas i större

Parallellmarknader innebär dock inte en drivkraft för en grön omställning Ökad andel direktförsäljning räddar många lokala producenter och kan tyckas utgöra en drivkraft

Närmare 90 procent av de statliga medlen (intäkter och utgifter) för näringslivets klimatomställning går till generella styrmedel, det vill säga styrmedel som påverkar

Det har inte varit möjligt att skapa en tydlig överblick över hur FoI-verksamheten på Energimyndigheten bidrar till målet, det vill säga hur målen påverkar resursprioriteringar

Detta projekt utvecklar policymixen för strategin Smart industri (Näringsdepartementet, 2016a). En av anledningarna till en stark avgränsning är att analysen bygger på djupa