Multiple comparisons
From Wikipedia, the free encyclopedia
In statistics, the multiple comparisons problem occurs when one considers a set, or family, of statistical inferences simultaneously. Errors in inference, including confidence intervals that fail to include their corresponding population parameters, or hypothesis tests that incorrectly reject the null hypothesis, are more likely when one considers the family as a whole.
The term "comparisons" in multiple comparisons typically refers to comparisons of two groups, such as treatment versus control. "Multiple comparisons" enters when there are several such comparisons; e.g., when comparing treatment versus control separately for each measurement that was made on the subjects. If there are 10 such measurements, then the number of comparisons (the size of the "family") is 10.
The family of statistical inferences can be comprised of confidence intervals, hypotheses tests, or both in combination.
To illustrate the issue in terms of confidence intervals, note that a single confidence interval at the 95% level will likely contain the population parameter it is meant to contain. However, if one considers 100 confidence intervals, simultaneously, with the coverage probability 95% each, it is highly likely that at least one will not contain its population parameter. In fact, the expected number of such intervals is exactly 5.0.
If the inferences are hypothesis tests rather than confidence intervals, the same issue arises. With just one test, performed at the usual 5% level, there is only a 5% chance of incorrectly rejecting the null hypothesis. However, with 100 tests where all null hypotheses are true, it is highly likely that at least one null hypothesis will be rejected incorrectly. In fact, the expected number of such rejections is exactly 5.0, as in the confidence interval case. These errors are called false positives. Many mathematical techniques have been developed to control the false positive error rate associated with making multiple statistical comparisons.
Not compensating for multiple comparisons can have important real world consequences; for instance, when the multiple comparisons involve drug efficacy, it may result in approval of a drug as an improvement over existing drugs, when it is in fact equivalent to the existing drugs. On the other hand, when the multiple comparisons involve drug safety, it could easily happen by chance that the new drug appears to be worse for some side-effect, when it is actually not worse for this side-effect.
Contents |
[edit] Flipping coins
For example, one might declare that a coin was biased if in 10 flips it landed heads at least 9 times. Indeed, if one assumes as a null hypothesis that the coin is fair, then the likelihood that a fair coin would come up heads at least 9 out of 10 times is 11 * (½)10 = 0.0107. This is relatively unlikely, and under statistical criteria such as p-value < 0.05, one would declare that the null hypothesis should be rejected - i.e. the coin is unfair.
A multiple comparisons problem arises if one wanted to use this test (which is appropriate for testing the fairness of a single coin), to test the fairness of many coins. Imagine if one was to test 100 fair coins by this method. Given that the probability of a fair coin coming up 9 or 10 heads in 10 flips is 0.0107, one would expect that in flipping 100 fair coins ten times each, to see a particular (i.e. pre-selected) coin come up heads 9 or 10 times would still be very unlikely, but seeing some coin, it doesn't matter which one, behave that way would be more likely than not. Precisely, the likelihood that all 100 fair coins are identified as fair by this criterion is (1 − 0.0107)100 ≈ 0.34. Therefore the application of our single-test coin-fairness criterion to multiple comparisons would more likely than not falsely identify a fair coin as unfair.
[edit] Formalism
Technically, the problem of multiple comparisons (also known as multiple testing problem) can be described as the potential increase in Type I error that occurs when statistical tests are used repeatedly: If n independent comparisons are performed, the experiment-wide significance level α (alpha) is given by
and it increases as the number of comparisons increases. If we do not assume that the comparisons are independent, then we can still say:
[edit] Methods
In order to retain the same overall rate of false positives (rather than a higher rate) in a test involving more than one comparison, the standards for each comparison must be more stringent. Intuitively, reducing the size of the allowable error (alpha) for each comparison by the number of comparisons will result in an overall alpha which does not exceed the desired limit, and this can be mathematically proved to be true using Bonferroni's inequality, regardless of independence or dependence among test statistics.
However, it can be demonstrated that this technique (called the Bonferroni method) is overly conservative, i.e., it will actually result in a true alpha that is substantially smaller than 0.05 when the test statistics are highly dependent and/or when many of the nulls are false; thereby failing to identify an unnecessarily high percentage of the true differences. For example, in fMRI analysis, tests are done over 100000 voxels in the brain. The Bonferroni method would require p-values to be smaller than .05/100000 to declare significance; this threshold might be considered too stringent for practical use.
Because simple techniques such as the Bonferroni method can be too conservative, there has been a great deal of attention paid to developing better techniques, such that the overall rate of false positives can be maintained without inflating the rate of false negatives unnecessarily. Such methods can be divided into general categories:
- Methods where total alpha can be proved to never exceed 0.05 (or some other chosen value) under any conditions. These methods provide "strong" control against Type I error, in all conditions including a partially correct null hypothesis.
- Methods where total alpha can be proved not to exceed 0.05 except under certain defined conditions.
- Methods which rely on an omnibus test before proceeding to multiple comparisons. Typically these methods require a significant ANOVA/Tukey range test before proceeding to multiple comparisons. These methods have "weak" control of Type I error.
- Empirical methods, which control the proportion of Type I errors adaptively, utilizing correlation and distribution characteristics of the observed data.
The advent of computerized resampling methods, such as bootstrapping and Monte Carlo simulations, has given rise to many techniques in the latter category. In some cases where exhaustive permutation resampling is performed, these tests provide exact, strong control of Type I error rates; in other cases, such as bootstrap sampling, they provide only approximate control.
[edit] Post-hoc testing of ANOVAs
Multiple comparison procedures are commonly used after obtaining a significant omnibus test, like the ANOVA F-test. The significant ANOVA result suggests rejecting the global null hypothesis H0 = "means are the same". Multiple comparison procedures are then used to determine which means are different from which.
Comparing K means involves K(K − 1)/2 pairwise comparisons.
- The Nemenyi test is similar to the ANOVA Tukey test.
- The Bonferroni-Dunn test allows comparisons, controlling the familywise error rate.[vague]
- Student Newman-Keuls Post-Hoc ANOVA Analysis
The Friedman test is the non-parametric alternative to ANOVA. Multiple comparisons can be done using pairwise comparisons (for example using Wilcoxon signed ranks tests) and using a correction to determine if the post-hoc tests are significiant (for example a Bonferroni correction).
[edit] Large-scale multiple testing
For large-scale multiple testing (for example, as is very common in genomics when using technologies such as DNA microarrays) one can instead control the false discovery rate (FDR), defined to be the expected proportion of false positives among all significant tests. One simple meta-test is to use a Poisson distribution whose mean is the expected number of significant tests, equal to α times the number of comparisons, to estimate the likelihood of finding any given number of significant tests.
[edit] See also
- Key concepts
- Comparisonwise error rate
- Experimentwise error rate
- Familywise error rate
- False discovery rate (FDR)
- Post-hoc analysis
- General methods of alpha adjustment for multiple comparisons
- Closed testing procedure
- Bonferroni correction
- Boole-Bonferroni bound
- Dunn-Šidák bound
- Holm-Bonferroni method
- Testing hypotheses suggested by the data
- Westfall-Young step-down approach
- Single-step procedures
- Tukey-Kramer method (Tukey's HSD) (1951)
- Scheffe method (1953)
- Two-step procedures
- Fisher's protected LSD (1935)
- Multi-step procedures based on Studentized range statistic
- Student Newman Kuels method (1939)
- Tukey B method (Mid 1950s probably 1953-4)
- Duncan's new multiple range test (1955)
- Ryan Einot Gabriel Welsch method (1960-mid1970s)
- Bayesian methods
- Duncan-Waller k-ratio t-test
[edit] Bibliography
- Miller, R G (1966) Simultaneous Statistical Inference (New York: McGraw-Hill). ISBN 0-387-90548-0
- Miller, R G (1981) "Simultaneous Statistical Inference 2nd Ed" (Springer Verlag New York) ISBN 0-387-90548-0
- Benjamini, Y, and Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing, Journal of the Royal Statistical Society, Series B (Methodological) 57:125-133.
- Storey JD and Tibshirani (2003) "Statistical significance for genome-wide studies" PNAS 100, 9440–9445. [1]



