Aus: Statnotes: Topics in Multivariate Analysis, by G. David Garson http://faculty.chass.ncsu.edu/garson/pa765/anova.htm (Zugriff am 20.10.2010) Planned multiple comparison t-tests, also just called "multiple comparison tests". In oneway ANOVA for confirmatory research, when difference of means tests are pre-planned and not just post-hoc, as when a researcher plans to compare each treatment group mean with the mean of the control group, one may apply a simple t-test, a Bonferroni-adjusted t-test, the Sidak test, or Dunnett's test. The last two are also variants of the t-test. The t-test is thus a test of significance of the difference in the means of a single interval dependent, for the case of two groups formed by a categorical independent. The difference between planned multiple comparison tests discussed in this section and posthoc multiple comparison tests discussed in the next section is one of power, not purpose. Some, including SPSS, lump all the tests together as "post hoc tests", as illustrated below. This figure shows the SPSS post hoc tests dialog after the Post Hoc button is pressed in the GLM Univariate dialog. (There is a similar dialog when Analyze, Compare Means, One- Way ANOVA is chosen, invoking the SPSS ONEWAY procedure, which the GLM procedure has superceded). The essential difference is that the planned multiple comparison tests in this section are based on the t-test, which generally has more power than the post-hoc tests listed in the next section. Warning! The model, discussed above, will make a difference for multiple comparison tests. A factor (ex., race) may display different multiple comparison results depending on what other factors are in the model. Covariates cannot be in the model at all for these tests to be done. Interactions may be in the model, but multiple comparison tests are not available to test them. Also note that all these t-tests are subject to the equality of variances assumption and therefore the data must meet Levene's test, discussed below. Finally, note that the significance level (.05 is default) may be set using the Options button off the main GLM dialog.
1. Simple t-test difference of means. The simple t-test is recommended when the researcher has a single planned comparison (a comparison of means specified beforehand on the basis of à priori theory). In SPSS, for One-Way ANOVA, select Analyze, Compare Means, One-Way ANOVA; click Post Hoc; select the multiple comparison test you want. If the Bonferroni test is requested, SPSS will print out a table of "Multiple Comparisons" giving the mean difference in the dependent variable between any two groups (ex., differences in test scores for any two educational groups). The significance of this difference is also printed, and an asterisk is printed next to differences significant at the.05 level or better. SPSS supports the Bonferroni test in its GLM and UNIANOVA procedure. SPSS. A simple t-test, with or without Bonferroni adjustment, may be obtained by selecting Statistics, Compare Means, One-Way ANOVA. Example. 2. Bonferroni-adjusted t-test. Also called the Dunn test, Bonferroni-adjusted t-tests are used when there are planned multiple comparisons of means. As a general principle, when comparisons of group means are selected on a post hoc basis simply because they are large, there is an expected increase in variability for which the researcher must compensate by applying a more conservative test -- otherwise, the likelihood of Type I errors will be substantial. The Bonferroni adjustment is perhaps the most common approach to making post-hoc significance tests more conservative. The Bonferroni method applies the simple t-test, but then adjusts the significance level by multiplying by the number of comparisons being made. For instance, a finding of.01 significance for 9 comparisons becomes.09. This is equivalent to saying that if the target alpha significance level is.05, then the t-test must show alpha/9 (ex.,.05/9 =.0056) or lower for a finding of significance to be made. Bonferroni-adjusted multiple t-tests are usually employed only when there are few comparisons, as with many it quickly becomes practically impossible to show
significance. If the independents formed 8 groups there would be 8!/6!2! = 28 comparisons and if one used the.05 significance level, one would expect at least one of the comparisons to generate a false positive (thinking you had a relationship when you did not). Note this adjustment may be applied to F-tests as well as t-tests. That is, it can handle nonpairwise as well as pairwise comparisons. The Bonferroni-adjusted t-test imposes an extremely small alpha significance level as the number of comparisons becomes large. That is, this method is not recommended when the number of comparisons is large because the power of the test becomes low. Klockars and Sax (1986: 38-39) recommend using a simple.05 alpha rate when there are few comparisons, but using the more stringent Bonferroni-adjusted multiple t-test when the number of planned comparisons is greater than the number of degrees of freedom for between-groups mean square (which is k-1, where k is the number of groups). Nonetheless, researchers still try to limit the number of comparisons, trying to reduce the probability of Type II errors (accepting a false null hypothesis). This test is not recommended when the researcher wishes to perform all possible pairwise comparisons. By the Bonferroni test, the figure above shows whites are significantly different from blacks but not from "other" races, with respect to mean highest year of education completed (the dependent variable). 3. Sidak test. The Sidak test, also called the Dunn-Sidak test, is a variant on the Dunn or Bonferroni approach, using a t-test for pairwise multiple comparisons. The alpha significance level for multiple comparisons is adjusted to tighter (more accurate) bounds than for the Bonferroni test (Howell, 1997: 364). SPSS supports the Sidak test in its GLM and UNIANOVA procedures. In the figure above, the Sidak test shows the same pattern as the Bonferroni test. 4. Dunnett's test is a t-statistic which is used when the researcher wishes to compare each treatment group mean with the mean of the control group, and for this purpose has better power than alternative tests. Dunnett's test does not require a prior finding of significance in the overall F test "as it controls the familywise error rate independently" (Cardinal & Aitken, 2005: 89). This test, based on a 1955 article by Dunnett, is not to be confused with Dunnett's C or Dunnett's T3, discussed below. In the example illustrated above, Dunnett's test leaves out the last category ("other"
race) as the reference category and shows whites are not significantly different from "other" but blacks are. HSU's multiple comparison with the best (MCB) test. HSU's MCB is an adaptation of Dunnett's method for the situation where the researcher wishes to compare the mean of each level with the best level, as in a treatment experiment where the best treatment is known. In such analyses the purpose is often to identify alternative treatments which are not significantly different from the best treatment but which may cost less or have other desirable features. HSU's MCB is supported by SAS JMP but not SPSS. HSU's unconstrained multiple comparison with the best (UMCB) test is a variant which takes each treatment group in turn as a possble best treatment and compares all others to it. Post-hoc multiple comparison tests, also just called "post-hoc tests," are used in exploratory research to assess which group means differ from which others, after the overall F test has demonstrated at least one difference exists. If the F test establishes that there is an effect on the dependent variable, the researcher then proceeds to determine just which group means differ significantly from others. That is, post-hoc tests are used when the researcher is exploring differences, not limited by ones specified in advance on the basis of theory. These tests may also be used for confirmatory research but the t-test-based tests in the previous section are generally preferred. In comparing group means on a post-hoc basis, one is comparing the means on the dependent variable for each of the k groups formed by the categories of the independent factor(s). The possible number of comparisons is k(k-1)/2. Multiple comparisons help specify the exact nature of the overall effect determined by the F test. However, note that post hoc tests do not control for the levels of other factors or for covariates (that is, interaction and control effects are not taken into account). Findings of significance or nonsignificance between factor levels must be understood in the context of full ANOVA F- test findings, not just post hoc tests, which are subordinant to the overall F test. Note the model cannot contain covariates when employing these tests. Computation. The q-statistic, also called the q range statistic or the Studentized range statistic, is commonly used in coefficients for post-hoc multiple comparisons, though some post hoc tests use the t statistic. In contrast to the planned comparison t-test, coefficients based on the q-statistic, are commonly used for post-hoc comparisons - that is, when the researcher wishes to explore the data to uncover large differences, without limiting investigation by à priori theory). Both the q and t statistics use the difference of means in the numerator, but where the t statistic uses the standard error of difference between the means in the denominator, q uses the standard error of the mean. Consequently, where the t test tests the difference between two means, the q-statistic tests the probability that the largest mean and smallest mean among the k groups formed by the categories of the independent(s) were sampled from the same population. If the q-statistic computed for the two sample means is not as large as the criterion q value in a table of critical q values, then the researcher cannot reject the null hypothesis that the groups do not differ at the given alpha significance level (usually.05). If the null hypothesis is not rejected for the largest compared to smallest group means, it follows that all intermediate groups are also drawn from the same population -- so the q-statistic is thus also a test of homogeneity for all k groups formed by the independent variable(s). Output formats: pairwise vs. multiple range. In pairwise comparisons tests, output is produced similar to the Bonferroni and Sidk tests above, for the LSD, Games-Howell,
Tamhane's T2 and T3, Dunnett's C, and Dunnett's T3 tests. Homogeneous subsets for range tests are provided for S-N-K, Tukey's b, Duncan, R-E-G-W F, R-E-G-W Q, and Waller. Some tests are of both types: Tukey's honestly significant difference test, Hochberg's GT2, Gabriel's test, and Scheff?s test. Warning! The model, discussed above, will make a difference for post hoc tests. A factor (ex., race) may display different multiple comparison results depending on what other factors are in the model. Covariates cannot be in the model at all for these tests to be done. Interactions may be in the model, but multiple comparison tests are not available to test them. Also note that all the post-hoc tests are subject to the equality of variances assumption and therefore the data must meet Levene's test, discussed below, with the exception of Tamhane's T2, Dunnett's T3, Games-Howell, and Dunnett's C, all of which are tailored for data where equal variances cannot be assumed. Finally, note that the significance level (.05 is default) may be set using the Options button off the main GLM dialog. Tests assuming equal variances 1. Least significant difference (LSD) test. This test, also called the Fisher's LSD, the protected LSD, or the protected t test, is based on the t-statistic and thus can be considered a form of t-test. "Protected" means the LSD test should be applied only after the overall F test is shown to be significant. LSD compares all possible pairs of means after the F-test rejects the null hypothesis that groups do not differ (this is a requirement of the test). (Note some computer packages wrongly report LSD t-test coefficients for comparisons even if the F test leads to acceptance of then null hypothesis). It can handle both pairwise and nonpairwise comparisons and does not require equal sample sizes. LSD is the most liberal of the post-hoc tests (it is most likely to reject the null hypothesis in favor of finding groups do differ). It controls the experimentwise Type I error rate at a selected alpha level (typically 5%), but only for the omnibus (overall) test of the null hypothesis. LSD allows higher Type I errors for the partial null hypotheses involved in the comparisons. Toothaker (1993: 42) recommends against any use of LSD on the grounds that it has poor control of experimentwise alpha significance, and better alternatives exist such as Shaffer-Ryan, discussed below. Others, such as Cardinal & Aitken (2005: 86) recommend its use only for factors with three levels. However, the LSD test is the default in SPSS for pairwise comparisons in its GLM or UNIANOVA procedures. As illustrated below, the LSD test is interpreted in the same manner as the Bonferroni test above and for this example yields the same substantive results: whites differ significantly from blacks but not other races on mean highest school year completed.
The Fisher-Hayter test is a modification of the LSD test meant to control for the liberal alpha significance level allowed by LSD. It is used when all pairwise comparisons are done post-hoc, but power may be low for fewer comparisons. See Toothaker (1993: 43-44). SPSS does not support the Fisher-Hayter test. 2. Tukey's test, a.k.a. Tukey honestly significant difference (HSD) test: As illustrated below, the multiple comparisons table for the Tukey test displays all pairwise comparisions between groups, interpreted in the same way as for the Bonferroni test discussed above. The Tukey test is conservative when group sizes are unequal. It is often preferred when the number of groups is large precisely because it is a conservative pairwise comparison test, and researchers often prefer to be conservative when the large number of groups threatens to inflate Type I errors. HSD is the most conservative of the posthoc tests in that it is the most likely to accept the null hypothesis of no group differences. Some recommend it only when all pairwise comparisons are being tested. When all pairwise comparisons are being tested, the Tukey HSD test is more powerful than the Dunn test (Dunn may be more powerful for fewer than all comparisons). The Tukey HSD test is based on the q-statistic (the Studentized range distribution) and is limited to pairwise comparisons. Select "Tukey" on the SPSS Post Hoc dialog (Example).
3. Tukey-b test, a.k.a. Tukey's wholly significant difference (WSD) test, also shown above, is a less conservative version of Tukey's HSD test, also based on the q-statistic. The critical value of WSD (Tukey-b) is the mean of the corresponding value for the Tukey's HSD test and the Newman-Keuls test, discussed below. In the illustration above, note no "Sig" significance values is output in the range test table for Tukey-b. Rather, the table shows there are two significantly different homogenous subsets on highest year of school completed, with the first group being blacks and the second group being whites and other race. 4. S-N-K or Student-Newman-Keuls test. also called the Newman-Keuls test, is a little-used post-hoc comparison test of the range type, also based on the q- statistic, which is used to evaluate partial null hypotheses (hypotheses that all but g of the k means come from the same population). It is recommended for one-way balanced ANOVA designs when there are only three means to be compared (Cardinal & Aitken, 2005: 87). Let k = the number of groups formed by categories of the independent variable(s). First all combinations of k-1 means are tested, then k-2 groups, and so on until sets of 2 means are tested. As one is proceeding toward testing ever smaller sets, testing stops if an insignificant range is discovered (that is, if the q-statistic for the comparison of the highest and lowest mean in the set [the "stretch"] is not as great as the critical value of q for the number of groups in the set). Klockars and Sax (1986: 57) recommend the Student-Newman-Keuls test when the researcher wants to compare adjacent means (pairs adjacent to each other when all means are presented in rank order). Toothaker (1993: 29) recommends Newman-Keuls only when the number of groups to be compared equals 3, assuming one wants to control the comparison error rate at the
experimentwise alpha rate (ex.,.05), but states that the Ryan or Shaffer-Ryan, or the Fisher-Hayter tests are preferable (Toothaker, 1993: 46). The example below shows the same homogenous groups as in the Tukey-b test above. Duncan test. A range test somewhat similar to the S-N-K test and also not commonly used due to poor control (Cardinal & Aitken, 2005: 88). Illustrated further below. 5. Ryan test (REGWQ): This is the Ryan-Einot-Gabriel-Welsch multiple range test based on range and is the usual Ryan test, a modified Student-Newman- Keuls test adjusted so critical values decrease as stretch size (the range from highest to lowest mean in the set being considered) decreases. The Ryan test is more powerful than the S-N-K test or the Duncan multiple range test discussed below. It is considered a conservative test and is recommended for one-way balanced ANOVA designs and is not recommended for unbalanced designs. The result is that Ryan controls the experimentwise alpha rate at the desired level (ex.,.05) even when the number of groups exceeds 3, but at a cost of being less powerful (more chance of Type II errors) than Newman- Keuls. As with Newman-Keuls, Ryan is a step-down procedure such that one will not get to smaller stretch comparisons if the null hypothesis is accepted for larger stretches of which they are a subset. Toothaker (1993: 56) calls Ryan the "best choice" among tests supported by major statistical packages because maintains good alpha control (ex., better than Newman-Keuls) while having at least 75% of the power of the most powerful tests (ex., better than Tukey HSD). Cardinal and Aiken (2005: 87) consider the Ryan test a "good compromise" between the liberal Student-Newman-Keuls test and the conservative Tukey HSD test. For the same data, it comes to the same conclusion as illustrated below. 6. Ryan test (REGWF): This is the Ryan test based on the F statistic rather than range. It is a bit more powerful than REGWQ, though less common and more computationally intensive. Also a conservative test, it tends to come to the same substantive conclusions as ordinary Ryan test. REGWF is supported by
SPSS but not SAS. The Shaffer-Ryan test modifies the Ryan test. It is also a protected or step-down test, requiring the overall F test reject the null hypothesis first but uses slightly different critical values. To date, Shaffer-Ryan is not supported by SAS or SPSS, but it is recommended by Toothaker (1993: 55) as "one of the best multiple comparison tests in terms of power." 7. The Scheffé test is a widely-used range test which works by first requiring the overall F test of the null hypothesis be rejected. If the null hypothesis is not rejected overall, then it is not rejected for any comparison null hypothesis. If the overall null hypothesis is rejected, however, then F values are computed simultaneously for all possible comparison pairs and must be higher than an even larger critical value of F than for the overall F test described above. Let F be the critical value of F as used for the overall test. For the Scheffé test, the new, higher critical value, F', is (k-1)f. The Scheffé test can be used to analyze any linear combination of group means. Output, illustrated below, is similar to other range tests discussed above and for this example comes to the same conclusions.
While the Scheffé test has the advantage of maintaining an experimentwise. 05 significance level in the face of multiple comparisons, it does so at the cost of a loss in statistical power (more Type II errors may be made -- thinking you do not have a relationship when you do). That is, the Scheffé test is a very conservative one (more conservative than Dunn or Tukey, for ex.), not appropriate for planned comparisons but rather restricted to post hoc comparisons. Even for post hoc comparisons, the test is used for complex comparisons and is not recommended for pairwise comparisons due to "an unacceptably high level of Type II errors" (Brown and Melamed, 1990: 35). Toothaker (1993: 28) recommends the Scheffé test only for complex comparisons, or when the number of comparisons is large. The Scheffé test is low in power and thus not preferred for particular comparisons, but it can be used when one wishes to do all or a large number of comparisons. Tukey's HSD is preferred for making all pairwise comparisons among group means, and Scheffé for making all or a large number of other linear combinations of group means. 8. Hochberg GT2 test. A range test considered similar to Tukey's HSD but which is quite robust against violation of homogeneity of variances except when cell sizes are extremely unbalanced. It is generally less powerful than Tukey's HSD when factor cell sizes are not equal.
9. Gabriel test. A range test based on the Studentized maximux modulus test. The Gabriel test is similar to but more powerful than the Hochberg GT2 test when cell sizes are unequal, but it tends to display a liberal bias as cell sizes vary greatly. 10.Waller-Duncan test. A range test based on a Bayesian approach, making it different from other tests in this section. When factor cells are not equal, it uses the harmonic mean of the sample sizes. The kratio is specified by the researcher in advance in lieu of specifying an alpha significance level (ex.,. 05). The kratio is known as the Type 1/Type 2 error seriousness ratio. The default value is 100, which loosely corresponds to a.05 alpha level; kratio = 500 loosely corresponds to alpha = 1. Tests not assuming equal variances. If the model is a one-way ANOVA with only one factor and no covariates and no interactions, then four additional tests are available which do not require the usual ANOVA assumption of homogeneity of variances. 1. Tamhane's T2 test. Tamhane's T2 is a conservative test. It is considered more appropriate than Tukey's HSD when cell sizes are unequal and/or when homogeneity of variances is violated.
2. Games-Howell test. The Games-Howell test is a modified HSD test which is appropriate when the homogeneity of variances assumption is violated. It is designed for unequal variances and unequal sample sizes, and is based on the q-statistic distribution. Games-Howell is slightly less conservative than Tamhane's T2 and can be liberal when sample size is small and is recommended only when group sample sizes are greater than 5. Because Games-Howell is only slightly liberal and because it is more powerful than Dunnett's C or T3, it is recommended over these tests. Toothaker (1993: 66) recommends Games-Howell for the situation of unequal (or equal) sample sizes and unequal or unknown variances. 3. Dunnett's T3 test and Dunnett's C test. These tests might be used in lieu of Games-Howell when it is essential to maintain strict control over the alpha significance level across multiple tests, similar to the purpose of Bonferroni adjustments (ex., exactly.05 or better). 4. The Tukey-Kramer test: This test, described in Toothaker (1993: 60), who also gives an appendix with critical values, controls experimentwise alpha. Requires equal population variances. Toothaker (p. 66) recommends this test for the situation of equal variances but unequal sample sizes. In SPSS, if you ask for the Tukey test and sample sizes are unequal, you will get the Tukey- Kramer test, using the harmonic mean. Not supported by SPSS 5. The Miller-Winer test: Not recommended unless equal population variances are assured. Not supported by SPSS