Keppel, G. & Wickens, T. D. Design and Analysis Chapter 6: Simultaneous Comparisons and the Control of Type I Errors

Keppel, G. & Wickes, T. D. Desig ad Aalysis Chapter 6: Simultaeous Comparisos ad the Cotrol of Type I Errors You should desig your research with specific questios i mid, which you the test with specific aalyses. However, your desig will ofte led itself to additioal aalyses, which may well allow you to lear more about the operatio of the variables i questio. These additioal aalyses come with the burde that you ll have a greater chace of makig a Type I error amog the additioal aalyses. This chapter covers meas of cotrollig Type I error amog these simultaeous comparisos. 6.1 Research Questios ad Type I Error The family of tests is a set of tests you ited to compute to address a set of research questios. The familywise Type I error rate (α FW ) is the probability of makig at least oe Type I error i the family of tests whe all H 0 are true. Whe you cosider the huge set of possible post hoc tests oe might compute, the you are cosiderig the experimetwise error rate (α EW ). Needless to say, it will typically be the case that α FW < α EW. With a per-compariso error rate (α), you ca compute the familywise Type I error rate for a umber of comparisos (c) as: α FW = 1 (1 α) c (6.1) Thus, usig K&W51 as a example, if you iteded to compute three comparisos usig α =.05 for each compariso, your α FW would be.1. Though ot as accurate as the above formula, for a quick ad dirty estimate of α FW you could simply use cα (which i this case would give you a estimate of.15). Of course, as the umber of comparisos grows, so does α FW. To covice yourself of this relatioship betwee c ad α FW, compute α FW for the umber of comparisos idicated below: Number of Comparisos (c) 5 6 α FW The formula for computig α FW oly works for orthogoal comparisos (i.e., assumed idepedece), but α FW also icreases with a icreasig umber of oorthogoal comparisos. Thus, because for K&W51 there are oly 3 orthogoal comparisos, the estimates of α FW above are ot accurate, though they still make the poit that α FW will icrease with icreasig umbers of comparisos. Decreasig your per-compariso error rate (α) will also serve to decrease your α FW. K&W distiguish the types of questios that experimeters might ask of their data. If the questios are the relatively small umber (e.g., rarely more tha a - 1) of primary questios, K&W 6-1

K&W suggest that o adjustmet of the per-compariso error rate (α) is ecessary. I m ot sure that joural editors would agree with this suggestio. It s much more typical that oe would wat to coduct a set of comparisos computed to uderstad the omibus ANOVA. Sometimes the umber of these comparisos is quite limited. Sometimes you wat to compute all possible simple pairwise comparisos. Ad sometimes you may be iterested i explorig a fairly large set of simple ad complex comparisos. The approach for cotrollig α FW varies for the differet situatios. Especially whe the umber of tests might be fairly large, it makes sese to adopt a α FW that is greater tha.05 (e.g.,.10). Keep i mid, however, that the guidelies for choosig α FW, or choosig a strategy for cotrollig Type I error i such simultaeous comparisos are ot rigid ad uiversally agreed o. 6. Plaed Comparisos OK, we ll discuss plaed comparisos, but keep i mid that joural editors might ot trust that you ve actually plaed the comparisos i advace. My advice would be to treat all comparisos as post hoc at least util you ve achieved the sort of stature i the disciplie that buys you some slack from editors. J Plaed comparisos must be specified i the iitial desig of a experimet. They are essetial ad pivotal tests ot a pla to coduct a fishig expeditio. For clearly plaed tests, familywise error correctio is geerally deemed uecessary. The comparisos that you choose to compute should be drive by theoretical cocers, rather tha cocers about orthogoality. However, orthogoal comparisos should be used because they keep hypotheses logically ad coceptually separate, makig them easier to iterpret. Noorthogoal comparisos must be iterpreted with care because of the difficulty of makig ifereces (iterpretig the outcomes). For istace, i a earlier editio, Keppel wodered, If we reject the ull hypothesis for two oorthogoal comparisos, which compariso represets the true reaso for the observed differeces? Do t allow yourself to be tempted ito computig a large umber of plaed comparisos. For a experimet with a levels, there are 1 + ((3 a - 1) / ) a comparisos (simple pairwise ad complex) possible. Please, do t ever compute all possible comparisos! If you thik carefully about your research, a much smaller set of plaed comparisos would be reasoable. A commo suggestio for multiple plaed comparisos is to coduct up to a - 1 comparisos with each compariso coducted at α =.05. The implicatio of this suggestio is that people are willig to tolerate a familywise error rate of (a - 1)(α). Thus, i a experimet with 5 levels, you could comfortably compute plaed comparisos with each compariso tested usig α =.05, for a familywise error rate of ~.19. As the umber of plaed comparisos becomes larger tha a - 1, cosider usig a correctio for α FW (e.g., Sidák-Boferroi procedure). K&W 6-

6.3 Restricted Sets of Cotrasts The Boferroi Procedure The most widely applicable familywise cotrol procedure for small families is the Boferroi correctio. The Boferroi iequality (α FW < c α) states, The familywise error rate is always less tha the sum of the per-compariso error rates of the idividual tests. Thus, to esure that our α FW is kept to a certai level, we could choose that level (e.g..05 or.10 depedig o our preferece) ad the divide that value by the umber of comparisos we wish to compute. Assumig that you are comfortable with α FW =.10 ad you are about to compute 5 comparisos, you would treat comparisos as sigificat if they occur with p α (the per-compariso rate), which would be.0 here. Give that SPSS prits out a t ad its associated p-value whe you ask it to compute cotrasts, you d be able to assess the sigificace of the t statistic by comparig it to your Boferroi per-compariso rate. For had computatio of such tests, you occasioally eed to compute a critical value of t or F for a per-compariso error rate that is ot foud i the table of critical values of F (A.1). For example, usig K&W51, suppose that you wated to compute 6 simple pairwise comparisos (e.g., vs. 1, vs. 0, etc.). If you were comfortable with α FW =.10, your percompariso error rate (α) would be.0167. (If you preferred α FW =.05, your per-compariso error rate would be.008.) Let s presume that we re usig α =.0167. Give homogeeity of variace (a topic that arises i Ch. 7) for K&W51, we would use the overall error term (MS S/A = 150.58) for ay compariso. Thus, df Error = 1 ad df Compariso is always 1. I Table A.1, for those df we see the followig tabled α values: α F Crit.100 3.18.050.75.05 6.55.010 9.33.001 18.6 Although our α is ot tabled, we ca see that F Crit for our α would be less tha 9.33 ad greater tha 6.55. (Ad, of course, we could determie t Crit values by takig the square root of the tabled F Crit values.) Suppose that you compute F Compariso = 10.0 (or ay F 9.33). You would coclude that the two groups came from populatios with differet meas (reject H 0 ). Suppose, istead, that you compute F Compariso = 6.0 (or ay F 6.55). You would coclude that you had isufficiet evidece to claim that the two groups came from populatios with differet meas (retai H 0 ). The tricky stage arises if your F Compariso = 9.0. To assess the sigificace of this outcome, you eed to actually compute F Crit for α =.0167. You ca always use the formula below to determie the F for a give level of α. t = z + z 3 + z ()(df Error ) Do t eve ask what that complex formula meas! However, what it does is clear. Ultimately, it will geerate F Crit for ay α. You eed to keep two poits i mid. First, the z i the formula eeds to be two-tailed, so you eed to look up.0167/ =.008 i the tail of the K&W 6-3

uit ormal distributio. Secod, you re geeratig a t value, so you eed to square it to get a F Crit. I this example, z =.39. So t =.79 ad F = 7.79. Thus, usig the Boferroi procedure if you were iterested i computig 6 simple pairwise comparisos o the K&W51 data set (ad usig α FW =.10), to be sigificat each F Compariso eeds to be 7.79. Alteratively, you ca avoid the formula etirely ad use a web-based calculator, such as: http://www.graphpad.com/quickcalcs/statratio1.cfm J The Sidák-Boferroi Procedure This procedure is a modified Boferroi procedure that results i a bit more power, so it is preferred to the straight Boferroi procedure. It makes use of the followig equatio: α = 1 (1 α FW ) 1/c (6.5) Because of the preferece for this procedure, K&W provide useful tables i A.. The tables illustrate iformatio for α FW =.0, α FW =.10, α FW =.05, ad α FW =.01 o pages 578-581. Keepig with the above example, usig K&W51 ad 6 comparisos with α FW =.10, we would look o p. 579. The probability associated with 6 comparisos would be.0171 (which you could also obtai by substitutig.10 for α FW ad 6 for c i the above formula, but the table is easier). Note, of course, that if F Compariso yielded p =.0171 it would be sigificat with the Sidák-Boferroi procedure, but ot with the Boferroi procedure (where it would have to be.0167). Not a huge differece, but the Sidák-Boferroi procedure provides a bit more power. You ca also compare meas by computig a critical mea differece accordig to the formula below, as applied to the above example: D S B = t S B MS Error =.76 (150.58) = 3.9 Thus, oe possible compariso might be hr vs. 1hr. That differece would ot be sigificat (37.75 6.5 = 11.5). However, i comparig hr vs. 0hr we would fid a sigificat differece (57.5 6.5 = 31). Duett s Test If you have a cotrol group to which you wish to compare the other treatmets i your study, the the Duett test is appropriate. Oce agai, the most geeral approach is to compute F Comp ad the compare that F ratio to a F Crit value. For the Duett test, the F Crit is F D = (t D ) You look up the value of t D i Table A.5 (pp. 58-585). To do so, you would agai eed to decide o the level of α FW you d like to use ad the you d eed to kow how may coditios are ivolved i your experimet (cotrol plus experimetal groups). For the K&W51 example, let s assume that the hr group was a cotrol group (o sleep deprivatio) K&W 6-

to which you d like to compare each of the other groups. Thus, there would be a total of four groups. With α FW =.10 ad groups, t D =.9. Thus, you d compare each F Compariso agaist F D = 5.. You could also take a critical mea differece approach with the Duett test: D Duett = t Duett MS Error =.9 (150.58) = 19.86 (6.6) Note that the critical mea differece here is less tha that foud with the Sidák-Boferroi procedure, idicatig that the Duett test is more powerful. Noetheless, I would bet that you rarely fid yourself i a situatio where you ll wat to compute the Duett test. 6. Pairwise Comparisos Tukey s HSD Procedure If you are iterested i comparig every pair of meas (simple pairwise comparisos), you might use the Tukey HSD (Hoestly Sigificat Differece) Procedure. Usig this procedure requires you to use the Studetized Rage Statistic (q) foud i Appedix A.6 (pp. 586-589). Agai, you ca first compute F Compariso, after which you would compare that value to a critical value obtaied from the tables, which you the square. For the example we ve bee usig (K&W51, α FW =.10, 6 pairwise comparisos): F HSD = q = 3.6 = 6.55 Alteratively, you could compute a critical mea differece. For the example we ve bee usig, you d fid: D HSD = q a MS Error = 3.6 150.58 =. (6.7) Note that this procedure is more liberal tha the Sidák-Boferroi procedure for what are essetially the same 6 comparisos (D S-B = 3.9). K&W suggest comparig your differeces amog meas i a matrix. For K&W51, it would look like this: a 1 (6.5) a (37.75) a 3 (57.5) a (61.75) a 1 (6.5) ----- a (37.75) 11.5 ----- a 3 (57.5) 31.0 19.75 ----- a (61.75) 35.5.0.5 ----- This table allows you to see that there are three sigificat comparisos. You caot use this critical mea differece approach (Formula 6.7) whe you have uequal sample sizes (though the formula ca be modified as below) or should you take this approach whe there is heterogeeity of variace. K&W 6-5

Whe sample sizes are differet, replace i the formula with ñ, computed as: = 1 1 + 1 This value is actually a special kid of mea (harmoic). Thus, if oe group had = 10 ad the other group had = 0, the ñ for Formula would be 13.33. If you have reaso to suspect heterogeeity of variace (as discussed i Chapter 7), the formula would become: " s $ 1 # D HSD = q a + s 1 % ' & (6.8) The df you d use to look up q emerge from a complex formula (7.13), so we ll retur to this issue oce we ve discussed the implicatios of heterogeeity of variace. The Fisher-Hayter Procedure Tukey s procedure is the simplest way to test the pairwise differeces ad is the oe that is most applicable to ay patter of effects. Hmmm, so why are K&W talkig about alteratives? I geeral, some people were cocered that HSD is too coservative, so they wated to derive more powerful simple pairwise compariso procedures. The Fisher-Hayter procedure requires that you first compute a overall ANOVA ad reject H 0. If you are able to reject H 0 i the overall ANOVA, the use the Studetized Rage Statistic (q) foud i Appedix A.6 (pp. 586-589). For HSD, you d look up q for a treatmet meas. However, for Fisher-Hayter, you d look up q for a-1 treatmet meas. Otherwise, the formulas are the same, as see below: D FH = q a 1 MS Error = 3. 150.58 = 19.63 (6.9) I ve filled i the values that you d use for K&W51. The critical mea differece here is smaller tha that foud for HSD, so this test is more powerful. The Fisher-Hayter procedure provides excellet cotrol of Type I error, a fact that has bee demostrated i several simulatio studies...we suggest that you use this procedure, particularly whe makig calculatios by had. The Newma-Keuls ad Related Procedures This procedure is sometimes referred to as Studet-Newma-Keuls (SNK). K&W describe the procedure for computatio of SNK. However, my advice (ad theirs) would be to use Fisher-Hayter or Tukey s HSD. You ll see SNK used, but ofte by people who leared to use it log ago ad cotiue to do so, eve after better approaches have bee idetified. K&W 6-6

6.5 Post-Hoc Error Correctio You may be iclied to compute a whole host of comparisos, icludig some complex comparisos. If you re doig so i a exploratory fashio ( throw it agaist the wall ad see what sticks ), you are asked to pay some pealty by usig a very coservative test. Scheffé s Procedure The Scheffé test is the most coservative post hoc test. Basically, it cotrols for the FW error that would occur if you were to coduct every possible compariso. Ideally, a perso would be coductig far fewer comparisos tha that! Compute the Scheffé test by first computig the F Comp. I the presece of heterogeeity of variace, you would use separate variaces for the deomiator. I the presece of homogeeity of variace, you would use the pooled variace for the deomiator. The, test your F Comp for sigificace by comparig it to the F Crit Scheffé (F S ), where F S = (a - 1) F(df A, df S/A ) (6.11) Thus, for comparisos i K&W51, F S = (3)(3.9) = 10.7. Basically, I thik that you should avoid usig this procedure. Recommedatios ad Guidelies (from Keppel 3 rd ) As i all thigs, cotrollig for iflated chace of familywise Type I error calls for moderatio. The Sidák-Boferroi procedure seems a reasoable approach for smaller sets of comparisos. Tukey s HSD (Tukey-Kramer) or the Fisher-Hayter procedure seem to be reasoable for simple pairwise comparisos. Keep i mid that FW error rate may ot be as serious as it might appear to be. As Keppel otes, assumig that H 0 is true, if you replicated a experimet 000 times ad coducted the same 5 comparisos after each experimet, you would expect that a Type I error would occur i 500 experimets (.5 x 000). However, i those 500 experimets, oly 10% (50) would cotai more tha oe Type I error i the 5 comparisos. Fewer still would have more tha two. Furthermore, keep i mid that most experimets reflect some treatmet effect (i.e., H 0 is false). That is, you are rarely dealig with a situatio i which the treatmet has o effect. Keppel argues for the value of plaed comparisos (as does G. Loftus, 1996). Although I am i coceptual agreemet, I worry about the practicalities facig a researcher who chooses to report plaed comparisos. A joural editor may be sympathetic, but I worry about what a usympathetic editor/reviewer might say about plaed comparisos. Keppel does suggest that replicatios are importat especially as a meas of offsettig a perceived iflated Type I error rate. Oce agai, i a ideal world I d agree. However, a uteured researcher would probably beefit from doig publishable research, ad jourals are ot yet willig to publish replicatios. We must recogize that the decisio to correct for a iflated FW Type I error rate is a decisio to icrease the chace of makig a Type II error (i.e., a decrease i power). Thus, if you have a set of plaed comparisos i mid, you might well estimate your eeded sample size o the basis of the plaed comparisos, rather tha o the basis of the overall ANOVA. K&W 6-7

Usig SPSS for Comparisos If you choose to use Aalyze->Compare Meas->Oe-way ANOVA for aalysis, you ll first see the widow below left. If you click o the Post Hoc butto, you ll see the widow below right. As you ca see, may of the procedures that K&W describe are available i SPSS. However, the Fisher-Hayter procedure is t oe of the optios. Thus, if you choose to use this very reasoable post hoc procedure, you ll eed to do so outside of SPSS. Lookig at the similar Tukey s HSD for the KW51 data set, you d see the output below. errors Tukey HSD Multiple Comparisos (I) hrsdep (J) hrsdep Mea Differece 95% Cofidece Iterval (I-J) Std. Error Sig. Lower Boud Upper Boud 1-11.50 8.673.58-37.00 1.50 dimesio3 3-31.000 * 8.673.017-56.75-5.5-35.50 * 8.673.007-61.00-9.50 1 11.50 8.673.58-1.50 37.00 dimesio3 3-19.750 8.673.158-5.50 6.00 dimesio 3 -.000 8.673.071-9.75 1.75 1 31.000 * 8.673.017 5.5 56.75 dimesio3 19.750 8.673.158-6.00 5.50 -.50 8.673.960-30.00 1.50 1 35.50 * 8.673.007 9.50 61.00 dimesio3.000 8.673.071-1.75 9.75 *. The mea differece is sigificat at the 0.05 level. 3.50 8.673.960-1.50 30.00 K&W 6-8