Beyond the F Test: Effect Size Confidence Intervals and Tests of Close Fit in the Analysis of Variance and Contrast Analysis

Transcription

1 Psychological Methods 004, Vol. 9, No., Coyright 004 by the American Psychological Association X/04/$1.00 DOI: / X Beyond the F Test: Effect Size Confidence Intervals and Tests of Close Fit in the Analysis of Variance and Contrast Analysis James H. Steiger Vanderbilt University This article resents confidence interval methods for imroving on the standard F tests in the balanced, comletely between-subjects, fixed-effects analysis of variance. Exact confidence intervals for omnibus effect size measures, such as and the root-mean-square standardized effect, rovide all the information in the traditional hyothesis test and more. They allow one to test simultaneously whether overall effects are (a) zero (the traditional test), (b) trivial (do not exceed some small value), or (c) nontrivial (definitely exceed some minimal level). For situations in which single-degree-of-freedom contrasts are of rimary interest, exact confidence interval methods for contrast effect size measures such as the contrast correlation are also rovided. The analysis of variance (ANOVA) remains one of the most commonly used methods of statistical analysis in the behavioral sciences. Most ANOVAs, esecially in exloratory studies, reort an omnibus F test of the hyothesis that a main effect, interaction, or simle main effect is recisely zero. In recent years, a number of authors (Cohen, 1994; Rosnow & Rosenthal, 1996; Schmidt, 1996; Schmidt & Hunter, 1997; Serlin & Lasley, 1993; Steiger & Fouladi, 1997) have sharly questioned the efficacy of tests of this nil hyothesis. Several of these critiques have concentrated on ways that the nil hyothesis test fails to deliver the information that the tyical behavioral scientist wants. However, a number of the articles have also suggested, more or less secifically, relacements for or extensions of the null hyothesis test that would deliver much more useful information. The suggestions have develoed along several closely related lines, including the following: I exress my gratitude to Michael W. Browne, Stanley A. Mulaik, the late Jacob Cohen, William W. Rozeboom, Rachel T. Fouladi, Gary H. McClelland, and numerous others who have encouraged this roject. Corresondence concerning this article should be addressed to James H. Steiger, Deartment of Psychology and Human Develoment, Box 51 Peabody College, Vanderbilt University, Nashville, TN jsteiger@bellsouth.net 1. Eliminate the emhasis on omnibus tests, with attention instead on focused contrasts that answer secific research questions, along with calculation of oint estimates and aroximate confidence interval estimates for some correlational measures of effect size (e.g., Rosenthal, Rosnow, & Rubin, 000; Rosnow & Rosenthal, 1996).. Calculate exact confidence interval estimates of measures of standardized effect size, using an iterative rocedure (e.g., Smithson, 001; Steiger & Fouladi, 1997). 3. Perform tests of a statistical null hyothesis other than that of no difference or zero effect (e.g., Serlin & Lasley, 1993). As roonents of the first suggestion, Rosnow and Rosenthal (1996) discussed several tyes of correlation coefficients that are useful in assessing exerimental effects. Their work is articularly valuable in situations in which the researcher has questions that are best addressed by testing single contrasts. Rosnow and Rosenthal emhasized the use of the Pearson correlation, rather than the squared multile correlation, artly because of concern that the latter tends to resent an overly essimistic icture of the value of small exerimental effects. The second suggestion, exact interval estimation, has been gathering momentum since around The movement to relace hyothesis tests with confidence intervals stems from the fundamental realization that, in many if not most situations, confidence intervals rovide more of the information that the scientist is truly interested in. For examle, in a two-grou exeriment, the scientist is more interested in knowing how large the difference between the two grous is (and how recisely it has been determined) 164

2 BEYOND THE F TEST 165 than whether the difference between the grous is exactly zero. The third suggestion, which might be called tests of close fit, has much in common with the aroach widely known to biostatisticians as bioequivalence testing and is based on the idea that the scientist should not be testing erfect adherence to a oint hyothesis but should relace the test of close fit with a relaxed test of a more aroriate hyothesis. Tests of close fit share many of their comutational asects with the exact interval estimation aroach in terms of the software routines required to comute robability levels, ower, and samle size. They remain within the familiar hyothesis-testing framework, while roviding imortant ractical and concetual gains, esecially when the exerimenter s goal is to demonstrate that an effect is trivial. In this article, I resent methods that imlement; suort; and, in some cases, unify and extend major suggestions (1) through (3) discussed above. First I briefly review the history, rationale, and theory behind exact confidence intervals on measures of standardized effect size in ANOVA. I then rovide detailed instructions, with examles, and software suort for comuting these confidence intervals. Next I discuss a general rocedure for assessing effects that are reresented by one or more contrasts, using correlations. Included is a oulation rationale, with samling theory and an exact confidence interval estimation rocedure, for one of the correlational measures discussed by Rosnow and Rosenthal (1996). Although the initial emhasis is on confidence interval estimation, I also discuss how the same technology that generates confidence intervals may be used to test hyotheses of minimal effect, thus imlementing the good enough rincile discussed by Serlin and Lasley (1993). Exact Confidence Intervals on Standardized Effect Size The notion that hyothesis tests of zero effect should be relaced with exact confidence intervals on measures of effect size has been around for quite some time but was somewhat imractical because of its comutational demands until about 10 years ago. A general method for constructing the confidence intervals, which Steiger and Fouladi (1997) referred to as noncentrality interval estimation, is considered elementary by statisticians but seldom is discussed in behavioral statistics texts. In this section, I review some history, then describe the method of noncentrality interval estimation in detail. Rationale and History Suose that, as a researcher, you test a drug that you believe enhances erformance. You erform a simle twogrou exeriment with a double-blind control. In this case, you are engaging in reject suort (R-S) hyothesis testing (rejecting the null hyothesis will suort your belief). The null and alternative hyotheses might be H 0 : 1 ; H 1 : 1. (1) The null hyothesis states that the drug is no better than a lacebo. The alternative, which the investigator believes, is that the drug enhances erformance. Rejecting the null hyothesis, even at a very low alha such as.001, need not indicate that the drug has a strong effect, because if samle size is very large relative to the samling variability of the drug effect, even a trivial effect might be declared highly significant. On the other hand, if samle size is too low, even a strong effect might have a low robability of creating a statistically significant result. Statistical ower analysis (Cohen, 1988) and samle size estimation have been based on the notion that calculations made before data are gathered can hel to create a situation in which neither of the above roblems is likely to occur. That is, samle size is chosen so that ower will be high, but not too high. There is an alternative situation, accet suort (A-S) testing, that attracts far less attention than R-S testing in statistics texts and has had far less imact on the oular wisdom of hyothesis testing. In A-S testing, the statistical null hyothesis is what the exerimenter actually wishes to rove. Acceting the statistical null hyothesis suorts the researcher s theory. Suose, for examle, an exeriment rovides convincing evidence that the above-mentioned drug actually works. The next ste might be to rovide convincing evidence that it has few, or accetably low, side effects. In this case two grous are studied, and some measure of side effects is taken. The null hyothesis is that the exerimental grou s level of side effects is less than or equal to the control grou s level. The researcher (or drug comany) suorting the research wants not to reject this null hyothesis, because in this case acceting the null hyothesis suorts the researcher s oint of view, that is, that the drug is no more harmful than its redecessors. In a similar vein, a comany might wish to show that a generic drug does not differ areciably in bioavailability from its brand name equivalent. This roblem of bioequivalence testing is well known to biostatisticians and has resulted in a very substantial literature (e.g., Chow & Liu, 000). Suose that Drug A has a well-established bioavailability level A, and an investigator wishes to assess the bioequivalence of Drug B with Drug A. One might engage in A-S testing, that is, test the null hyothesis that A B ()

3 166 STEIGER and declare the two drugs bioequivalent if this null hyothesis is not rejected. However, the erils of such A-S testing are even greater than in R-S testing. Secifically, simly running a sloy, low-ower exeriment will tend to result in nonrejection of the null hyothesis, even if the drugs differ areciably in bioavailability. Thus, aradoxically, someone trying to establish the bioequivalence of Drug B with Drug A could virtually guarantee success simly by using too small a samle size. Moreover, with extremely large samle sizes, Drug B might be declared nonequivalent to Drug A even if the difference between them is trivial. Because of such roblems, biostatisticians decided long ago that the test for strict equality is inaroriate for bioavailability studies (Metzler, 1974). Rather, a dual hyothesis test should be erformed. Suose that the Food and Drug Administration has determined that any drug with bioavailability within 0% of A may be considered bioequivalent and rescribed in its stead. Suose that 1 and reresent these bioequivalence limits. Then establishing bioequivalence of Drug B with Drug A might amount to rejecting the following hyothesis, against the alternative H 0 : B or B 1 (3) H a : 1 B. (4) In ractice, this usually amounts to testing two one-sided hyotheses, and H 01 : B 1 versus H a1 : B 1 (5) H 0 : B versus H a : B. (6) An alternative aroach (Westlake, 1976) is to construct a confidence interval for B. Bioequivalence would be declared if the confidence interval falls entirely within the established bioequivalence limits. In other contexts, articularly the more exloratory studies erformed in sychology, the research goal may be simly to inoint the nature of a arameter rather than to decide whether it is within a known fixed range. In that case, reorting the endoints of a confidence interval (without announcing an associated decision) may be an aroriate conclusion to an analysis. In any case, because the hyothesis test may be erformed with the confidence interval, it seems that the confidence interval should always be reorted. It contains all the information in a hyothesis test result, and more. In structural equation modeling, which includes factor analysis and multile regression as secial cases, statistical testing rior to 1980 was limited to a chi-square test of erfect fit. In this rocedure, the statistical null hyothesis is that the model fits erfectly in the oulation. This hyothesis test was erformed, and a model was judged to fit the data sufficiently well if the null hyothesis was not rejected. There was widesread dissatisfaction with the test, because no model would be exected to fit erfectly, and so large samle sizes usually led to rejection of a model, even if it fit the data quite well. In this arrangement, enhanced recision actually worked against the researcher s interests. Steiger and Lind (1980) suggested that the traditional null hyothesis test of erfect fit of a structural model be relaced by a confidence interval on the root-mean-square error of aroximation (RMSEA), an index of oulation badness of fit that comensated for the comlexity of the model. MacCallum, Browne, and Sugawara (1996) suggested augmenting the confidence interval with a air of hyothesis tests. They considered a oulation RMSEA value of.05 to be indicative of a close-fitting model, whereas a value of.08 or more was evidence of marginal to oor fit. Consequently, a test of close fit would test the null hyothesis that the RMSEA is greater than or equal to.05 against the alternative that it is less than.05. Rejection of the null hyothesis indicates close fit. A test of not-close fit tests the null hyothesis that the RMSEA is less than or equal to.08 against the alternative that it is greater than.08. Rejection of the null hyothesis indicates that fit is not close. MacCallum et al. demonstrated in detail how, with such hyothesis tests, ower calculations could be erformed and required samle sizes estimated. These two one-sided tests can be erformed easily and simultaneously with a single 1 confidence interval recommended by Steiger (1989). Simly construct the confidence interval and see whether its uer end is below.05 (in which case the test of close fit results in rejection at the alha level) and whether its lower end exceeds.08 (in which case the test of not-close fit results in rejection at the alha level). The confidence interval rovides all the information in both hyothesis tests, and more. Fleishman (1980) suggested interval estimation as a sulement for the F test in ANOVA. He gave examles of how to comute exact confidence intervals on a number of useful quantities, such as the signal-to-noise ratio, in ANOVA. These confidence intervals offered clear advantages over the traditional hyothesis test. Other authors have noted the existence of exact confidence intervals for the standardized effect size in the simlest secial case of ANOVA, the two-samle t test (e.g., Hedges & Olkin, 1985). The rationale for switching from hyothesis testing to confidence interval estimation is straightforward (Steiger & Fouladi, 1997). Unfortunately, the exact interval estimation rocedures of Steiger and Lind (1980), Fleishman (1980), and Hedges and Olkin (1985) are virtually imossible to comute accurately by hand. However, by 1990, microcomuter caabilities had advanced substantially. The RMSEA

4 BEYOND THE F TEST 167 confidence interval was imlemented in general urose structural equation modeling software (Mels, 1989; Steiger, 1989) and, by the late 1990s, had achieved widesread use. Steiger (1990) resented general rocedures for constructing confidence intervals on measures of effect size in covariance structure analysis, ANOVA, contrast analysis, and multile regression. Steiger and Fouladi (199) roduced a general comuter rogram, R, that erformed exact confidence interval estimation of the squared multile correlation in multile regression. Taylor and Muller (1995, 1996) have resented general rocedures for analyzing ower and noncentrality in the general linear model, including an analysis of the imact of restriction of ublished articles to significant results. Steiger and Fouladi (1997) demonstrated general rocedures for confidence interval calculations, and Steiger (1999) imlemented these in a commercial software ackage. Smithson (001) discussed a number of confidence interval rocedures in fixed and random regression models and included SPSS macros for calculating confidence intervals for noncentral distributions. Reiser (001) discussed confidence intervals on functions of Mahalanobis distance. General Theory of Noncentrality-Based Interval Estimation In this section, I review the general theoretical rinciles for constructing exact confidence intervals for effect size, ower, and samle size in the balanced fixed-effects between-subjects ANOVA. For a more detailed discussion of these rinciles, see Steiger and Fouladi (1997). Throughout what follows, I adot a simle notational device: When several grous or cells are samled, I use N tot to stand for the total samle size and use n to stand for the number of observations in each grou. I begin this section with a brief nontechnical discussion of noncentral distributions. The t, chi-square, and F distributions are secial cases of more general distributions called the noncentral t, noncentral chi-square, and noncentral F. Each of these noncentral distributions has an additional arameter, called the noncentrality arameter. For examle, whereas the F distribution has two arameters (the numerator and denominator degrees of freedom), the noncentral F has these two lus a noncentrality arameter (often indicated with the symbol ). When the noncentral F distribution has a noncentrality arameter of zero, it is identical to the F distribution, so it includes the F distribution as a secial case. Similar facts hold for the t and chi-square distributions. What makes the noncentrality arameter esecially imortant is that it is related very closely to the truth or falsity of the null hyotheses that these distributions are tyically used to test. Thus, for examle, when the null hyothesis of no difference between two means is correct, the standard t statistic has a distribution that has a noncentrality arameter of zero, whereas if the null hyothesis is false, it has a noncentral t distribution, that is, the noncentrality arameter is nonzero. The more false the null hyothesis, the larger the absolute value of the noncentrality arameter for a given alha and samle size. Most confidence intervals in introductory textbooks are derived by simle maniulation of a statement about interval robability of a samling distribution. This aroach cannot be used to generate exact confidence intervals for many quantities of fundamental imortance in statistics. As an examle, consider the samle squared multile correlation, whose distribution changes as a function of the oulation squared multile correlation. Confidence intervals for the squared multile correlation are very informative yet are not discussed in standard texts, because a single simle formula for the direct calculation of such an interval cannot be obtained in a manner analogous to the way one obtains a confidence interval for the oulation mean. Steiger and Fouladi (1997) discussed a general method for confidence interval construction that handles many such interesting examles. The method combines two general rinciles, which they called the confidence interval transformation rincile and the inversion confidence interval rincile. The former is obvious but seldom discussed formally. The latter is referred to by a variety of names in textbooks and review articles (Casella & Berger, 00; Steiger & Fouladi, 1997), yet it does not seem to have found its way into the standard behavioral statistics textbooks, rimarily because its imlementation involves some difficult comutations. However, the method is easy to discuss in rincile and is no longer imractical. When the two rinciles are combined, a number of very useful confidence intervals result. Proosition 1: Confidence interval transformation rincile. Let f() be a monotone function of, that is, a function whose sloe never changes sign and is never zero. Let l 1 and l be lower and uer endoints of a 1 confidence interval on quantity. Then, if the function is increasing, f(l 1 ) and f(l ) are lower and uer endoints, resectively, of a 100(1 )% confidence interval on f(). If the function is decreasing, f(l ) and f(l 1 ) are lower and uer endoints. Here are two elementary examles of this rincile. Examle 1: Suose you read in a textbook how to calculate a confidence interval for the oulation variance. However, you desire a confidence interval for. Because takes on only nonnegative values, it is a monotonic increasing function of over its domain. Hence, the confidence interval for is obtained by taking the square root of the endoints for the corresonding confidence interval for. Examle : Suose one calculates a confidence interval for z(), the Fisher transform of, the oulation correlation coefficient. Taking the inverse Fisher transform of the endoints of this interval will give a confidence interval for.

5 168 STEIGER This is, in fact, the method used to calculate the standard confidence interval for a correlation. These examles show why Proosition 1 is very useful in ractice. A statistical quantity we are very interested in such as may be a simle function of a quantity such as z() we are not so interested in, but for which we can easily obtain a confidence interval. Next, we define the inversion confidence interval rincile. Proosition : Inversion confidence interval rincile. Let x be the observed value of X, a random variable with a continuous cdf (cumulative distribution function) F(x, ) Pr(X x) for some numerical arameter. Let 1 with 0 1befixed values. If F(x, ) is strictly decreasing in, for fixed values of x, choose l 1 (x) and l (x) so that Pr[X x l 1 (x)] 1 and Pr[X x l (x)] 1.IfF(x, ) is strictly increasing in, for fixed values of x, choose l 1 (x) and l (x) so that Pr[X x l 1 (x)] 1 and Pr[X x l (x)] 1. Then the random interval [l 1 (x), l (x)] is a 100(1 )% confidence interval for. Uer or lower 100(1 )% confidence bounds (or one-sided confidence intervals ) may be obtained by setting 1 or to zero. For a simle grahically based exlanation of Proosition, consult Steiger and Fouladi (1997, ). For a clear, succinct discussion with artial roof, see Casella and Berger (00,. 43), who referred to this as ivoting the cdf. In this article, I assume 1 /, although such an interval may not be the minimum width for a given. Proosition imlies a simle aroach to interval estimation: Suose you have observed an F statistic with a value x and known degrees of freedom 1 and. Denote the cumulative distribution of the F statistic by F(x, ), where is the noncentrality arameter. It can be shown that if 1,, and x are held constant at any ositive value, then F(x, ) is strictly decreasing in. Accordingly, Proosition can be used. To calculate a 100(1 )% confidence interval on the noncentrality arameter of the F distribution, use the following stes. 1. Calculate the cumulative robability of x in the central F distribution. If is below /, then both limits of the confidence interval are zero. If is below 1 /, the lower limit of the confidence interval is zero, and the uer limit must be calculated (go to Ste 3). Otherwise, calculate both limits of the confidence interval, using Stes and 3.. To calculate the lower limit, find the unique value of that laces x at the 1 / cumulative robability oint of a noncentral F distribution with 1 and degrees of freedom. 3. To calculate the uer limit, find the unique value of that laces x at the / cumulative robability oint of a noncentral F distribution with 1 and degrees of freedom. Calculating a confidence interval for thus requires iterative calculation of the unique value of that laces an observed value of F at a articular ercentile of the noncentral F distribution. 1 In what follows, I give a variety of examles of confidence interval calculations. Some will be at the 95% level of confidence, others at the less common 90% level. In a later section, I discuss why, when confidence intervals are used to erform a hyothesis test at the.05 level, a 90% interval may be aroriate in some situations and a 95% interval in others. At that oint, I describe how to select confidence intervals at the aroriate level to erform a articular hyothesis test. Measures of Standardized Effect Size Now I examine some more ambitious examles. For simlicity of exosition, I assume in this section that either the freeware rogram NDC (noncentral distribution calculator; see Footnote 1) or other software is available to comute a confidence interval on, the noncentrality arameter of a noncentral F distribution. Consider the oneway, fixed-effects ANOVA, in which means are comared for equality, and there are n observations er grou. The overall F statistic has a distribution that is a noncentral F, with degrees of freedom 1 and (n 1) N tot. The noncentrality arameter can be exressed in a number of ways. One formula that aears frequently in textbooks is n j. (7) j1 The j values in Equation 7 are the effects as commonly defined in ANOVA, that is, j j. (8) If j is the mean of the jth grou, and is the overall mean, then is, in the case of equal n, simly the arithmetic average of the j. More generally (although in what follows I assume a balanced design unless stated otherwise), n j N j. (9) tot j1 1 NDC (noncentral distribution calculator), a freeware Windows rogram for calculating ercentage oints and noncentrality confidence intervals for noncentral F, t, and chi-square distributions, is available for direct download from the author s website (htt://

6 BEYOND THE F TEST 169 The quantity j / is a standardized effect, that is, the effect exressed in standard deviation units. The quantity /n is therefore the sum of squared standardized effects. There are numerous ways one might convert the sum of squared standardized effects into an overall measure of effect size. For examle, suose we average these squared standardized effects in order to obtain an overall measure of strength of effects in the design. The arithmetic average of the squared standardized effects, sometimes called the signal-to-noise ratio (Fleishman, 1980), is as follows: f 1 j n. (10) N tot j1 One roblem with this measure is that it is the average squared effect and so is not in the roer unit of measurement. A otential solution is to simly take the square root of the signal-to-noise ratio, obtaining f N tot 1 j1 j. (11) In a one-way ANOVA with grous and equal n, the effects are constrained to sum to zero, so there are actually only 1 indeendent effects. Thus, an alternative measure, /[( 1)n], is the average squared indeendent standardized effect, and the root-mean-square standardized effect (RMSSE) is as follows: 1n 1 1 j1 j. (1) Equations 11 and 1 demonstrate that the relationshis between, f, and the noncentrality arameter are straightforward. In order to obtain a confidence interval for, we roceed as follows. First, we obtain a confidence interval estimate for. Next, we invoke the confidence interval transformation rincile to directly transform the endoints by dividing by ( 1)n. Finally, we take the square root. The result is an exact confidence interval on. Examle 3: Suose a one-way fixed-effects ANOVA is erformed on four grous, each with a samle size of 0, and that an overall F statistic of 5.00 is obtained, with 3 and 76 degrees of freedom, with a robability level of.003. The F test is thus highly significant, and the null hyothesis is rejected at the.01 level. Some investigators might interret this result as imlying that a owerful exerimental effect was found and that this was determined with high recision. In this case, the noncentrality interval estimate rovides a more informative and somewhat different account of what has been found. The 95% confidence interval for ranges from to To convert this to a confidence interval for, we use Equation 1. The corresonding confidence interval for ranges from.1764 to Effects are almost certainly here, but they are on the order of half a standard deviation, what is commonly considered a medium-size effect. Moreover, the size of the effects has not been determined with high recision. Examle 4: Fleishman (1980) described the calculation of confidence intervals on the noncentrality arameter of the noncentral F distribution to obtain, in a manner equivalent to that used in the revious two examles, confidence intervals on f and, the latter of which is defined as A(artialed) S A, (13) S A e where S A is the variance of means for the levels of a articular effect A, that is, S A 1/ j (14) j1 and e is the within-cell variance. A(artialed) may be thought of as the roortion of the variance remaining (after all other main effects and interactions have been artialed out) that is exlained by the effect. (In what follows, for simlicity, I refer to the coefficient simly as.) There are simle relationshis between f,, and, secifically, and f 1 n (15) N tot f 1 f N tot. (16) Fleishman (1980) cited an examle given by Venables (1975) of a five-grou ANOVA with n 11 er cell and an observed F of In this case the 90% confidence interval for the noncentrality arameter has endoints and Once we obtain the confidence interval for, it is a trivial matter to transform the limits of the interval to confidence limits for, using Equation 16. For examle, the lower limit becomes (17) In a similar manner, the uer limit of the confidence interval can be calculated as.565. The confidence interval has determined with 90% confidence that the main effect

7 170 STEIGER accounts for between 6.1% and 56.5% of the variance in the deendent variable. General Procedures for Effect Size Intervals in Between-Subjects Factorial ANOVA In a revious examle, we saw how easy it is to construct a confidence interval on measures of effect size in one-way ANOVA, rovided a confidence interval for has been comuted. In this section, a comletely general method is demonstrated for comuting confidence intervals for various measures of standardized effect size in comletely between-subjects factorial ANOVA designs with equal samle size n er cell. We begin with a general formula relating the noncentrality arameter with the RMSSE in any comletely between-subjects factorial ANOVA. Let stand for a articular effect, and n the samle size er cell. Then n df. (18) In Equation 18, n is equal to n (the number of observations in each cell of the design) multilied by the roduct of the numbers of levels in all the factors not reresented in the effect currently under consideration; df is the numerator degrees of freedom arameter for the effect under consideration. There are simle relationshis between the RMSSE and other measures of standardized effect size. Secifically, for a general factorial ANOVA, f df, (19) Cells N tot Table 1 Key Quantities for Comuting Effect Size Intervals in Four-Way Analysis of Variance Source Levels df n A 1 nqrs B q q 1 nrs C r r 1 nqs D s s 1 nqr AB ( 1)(q 1) nrs AC ( 1)(r 1) nqs AD ( 1)(s 1) nqr BC (q 1)(r 1) ns BD (q 1)(s 1) nr CD (r 1)(s 1) nq ABC ( 1)(q 1)(r 1) ns ABD ( 1)(q 1)(s 1) nr ACD ( 1)(r 1)(s 1) nq BCD (q 1)(r 1)(s 1) n ABCD ( 1)(q 1)(r 1)(s 1) n Error qrs(n 1) Note. reresents a articular effect; n reresents the samle size er cell; and, q, r, and s reresent levels of factors A, B, C, and D, resectively. where Cells is, for any main effect, the number of levels of the effect. For any interaction, it is the roduct of the numbers of levels for all factors involved in the interaction. The relationshi between f and is given in Equation 16. Some examles of these quantities, for a four-way ANOVA, with, q, r, and s levels of factors A, B, C, and D, resectively, are given in Table 1. The table may be used also for one-, two-, or three-way ANOVAs simly by eliminating terms involving levels not reresented in the design. For examle, in a three-way ANOVA, the BC interaction effect has (q 1)(r 1) numerator degrees of freedom, and n BC is n, because there is no s in this design. The error degrees of freedom in a three-way ANOVA are qr(n 1). In the following two examles, I demonstrate how to comute a 90% confidence interval on various measures of effect, using the information in the table. Examle 5: Suose that, as a researcher, you erform a three-way 3 7 ANOVA, with n 6 observations er cell. In this case, we have, q 3, and r 7. Suose that, for the A main effect, you observe an F statistic of 4.708, which, with 1 and 10 degrees of freedom, has We first calculate a confidence interval for. The endoints of this interval are lower and uer To convert these to confidence intervals on, f, f, and, we aly Equations 18, 19, and 16. For the A effect, we have n A (6)(3)(7) 16, df A ( 1) 1, Cells A, and N tot 5. Hence, for we have, from Equation 18, lower , uer For f and f we have, for the lower limits, f lower (0) , f 5 lower (1) For the uer limits, we obtain f uer and f uer We can also convert the confidence limits for f into limits for, using Equation 16. We have lower f lower () 1 f lower In a similar manner, we obtain the uer limit as uer Examle 6: Table 1 can also be used for a two-way ANOVA, simly by letting r 1 and s 1 and ignoring all

8 BEYOND THE F TEST 171 effects involving factors C and D. Suose, for examle, one were to erform a two-way 7 ANOVA, with n 4 observations er cell, and the F statistic for the AB interaction is observed to be.50. The key quantities are df AB 6, df error 4, n AB 4, and Cells AB 14. The confidence limits for AB are lower and uer Consequently, from Equation 18, the confidence limits for the RMSSE are lower lower n AB df AB , (3) 46 uer (4) 46 The confidence intervals for f and f are f lower f uer , f 56 lower , (5) , f 56 uer (6) Using Equation 16, we convert the above to the following confidence limits for : lower f lower , (7) 1 f lower uer (8) Multile Regression With Fixed Regressors One standardized index of the size of effects is to comute the squared multile correlation coefficient between the indeendent variable and the scores on the deendent variable. This index, in the oulation, characterizes the strength of the effect. ANOVA may be concetualized as a linear regression model with fixed indeendent variables. In this case, the theory of multile regression with fixed regressors alies. It is imortant to realize (e.g., Samson, 1974) that the theory for fixed regressors, although it shares many similarities with that for random regressors, has imortant differences, which are esecially aarent when considering the nonnull distributions of the variables. The general model is E X, (9) where is an N tot 1 random vector, X is an N tot matrix, and is a 1 vector of unknown arameters. This model includes model errors () that are assumed to be indeendently and identically distributed with a normal distribution, zero mean, and variance. That is, X ˆ, (30) and has a multivariate normal distribution with zero mean vector 0 and covariance matrix I, with I an identity matrix. It is common to artition into 0 1, (31) where 0 is an intercet term. Corresondingly, X is artitioned as X 1 X 1, (3) where 1 is a column of ones and X 1 contains the original X scores transformed into deviations about their samle means. Consider now a set of observed scores y, reresenting realizations of the random variables in. IfX 1 has 1 columns, then an F statistic for testing the hyothesis that 1 0is R / 1 F 1 R /N tot. (33) This statistic has a noncentral F distribution with 1 and N tot degrees of freedom, with a noncentrality arameter given by XI P 1X XQ 1X. (34) For any matrix A of full column rank, P A is the column sace rojection oerator A(AA) 1 A and Q A the comlementary rojector I P A. We now turn to an alication of this theory in the context of ANOVA. Consider the simle case of a one-way fixed-effects ANOVA with n observations in each of indeendent grous. It is well-known that this model can be written in the form of Equation 9, where X is a design matrix with N tot n rows and columns, and contains ANOVA arameters. We are not interested in R er se. Rather, we are interested in the corresonding quantity in an infinite oulation of observations in which treatment grous are reresented equally. There are several alternative ways of concetualizing such a quantity. Formally, we can define as the robability limit of R, that is, limr. (35) n3 This is the constant that R converges to as the samle size increases without bound. It can be roven (see Aendix A) that, with this definition of, the noncentrality arameter is equivalent to

9 17 STEIGER and so N tot 1, (36) N tot. (37) Consequently, a confidence interval for may be converted easily into a confidence interval on or, because is nonnegative. reresents the coefficient of determination for redicting scores on the deendent variable from only a knowledge of the oulation means of the grous in an infinite oulation in which all treatment grous are equally reresented. Examle 7: Suose that X is set u as in Equation 38 to reresent a full rank design matrix for a one-way ANOVA, with three grous, and n 3, and that the scores in y are 1,, 3, 4, 5, 6, 7, 8, 9. In this arameterization, 0 corresonds to 3, 1 corresonds to 1 3, and corresonds to 3. The grou means are, 5, 8, and the grou variances are all 1. y y y 31 y y (38) y y y y In this case, it is easy to show using any standard multile regression rogram that the samle squared multile correlation for redicting y from X is.90 and that the F statistic for testing the null hyothesis that 0is F, 6 R / 1 R /6.9/ 7.0. (39).1/6 This F statistic is identical to the one obtained by erforming a one-way fixed-effects ANOVA on the data. The 90% confidence interval for has endoints of and The lower endoint for the confidence interval on, the coefficient of determination, is thus and the uer endoint is , (40) (41) With one-way ANOVA and equal n er grou, this confidence interval is identical to the one for discussed earlier. Note also that the samle R is ositively biased with small samle sizes and will consequently be much closer to the uer end of the confidence interval than the lower. One of several alternative methods for arameterizing the linear model in Equation 9 is to use what is sometimes called effect coding. In this case, the entries in X corresond to the contrast weights alied to grou means in the ANOVA null hyothesis. For examle, the hyothesis of no treatments in a one-way ANOVA with three grous corresonds to two contrasts simultaneously being zero, that is, and 3 0. The contrast weights for the two hyotheses are thus 1, 0, 1 and 0, 1, 1. Thus, omnibus effect size in ANOVA can be exressed as the multile correlation between a set of contrast weights and the deendent variable. There has been a fair amount of discussion in the alied literature (Ozer, 1985; Rosenthal, 1991; Steiger & Ward, 1987) about whether the coefficient of determination is overly essimistic in describing the strength of effects. Those who refer may convert a confidence interval on to a confidence interval on simly by taking the square root of the endoints of the former. Confidence Intervals on Single-Contrast Measures of Effect Size Rosenthal et al. (000) argued convincingly for the imortance of relacing the omnibus hyothesis in ANOVA with hyotheses that focus on substantive research questions. Often such hyotheses involve single contrasts of the form j1 c j j, with c j, the contrast weights and the null hyothesis being that 0. Rosenthal et al. discussed several different correlational measures for assessing the status of hyotheses on a single contrast. In this section, I discuss methods for exact confidence interval estimation of measures of effect size for a single contrast, including the oulation equivalent of the correlation measure r contrast discussed by Rosenthal et al. Exact Confidence Intervals for Standardized Contrast Effect Size Consider a contrast hyothesis on means, of the form H 0 : c j j 0. (4) j1 With equal samle sizes of n er grou, this hyothesis may be tested with a t statistic of the form t n ˆ MS within c j j1, (43)

10 BEYOND THE F TEST 173 with ˆ c j Y j, (44) j1 where Y j reresents the samle mean of the jth grou. The standardized effect size E s is the size of the contrast in standard deviation units, that is, E s. (45) The test statistic has a noncentral t distribution with (n 1) degrees of freedom and a noncentrality arameter of n E s LE s. (46) c j j1 To estimate E s, one obtains a confidence interval for, using the method discussed by Steiger and Fouladi (1997), and transforms the endoints of the confidence interval by dividing by L (i.e., the exression under the radical in Equation 46), as shown in the examle below. Examle 8: The data in Table reresent four indeendent grous of three observations each. Suose one wished to test the following null hyothesis: (47) This hyothesis tests whether the average of the means of the first and fourth grous is equal to the average of the means of the other two grous. Suose we observe t(8) The traditional 95% confidence interval for ranges from to Because mean square error is 1 in this examle, we would exect a confidence interval for E s to be similar. Actually, it is somewhat narrower. The 95% confidence interval for ranges from to The sum of squared contrast weights is 1, so L 3, and the endoints of the confidence interval are divided by 3 to obtain 95% confidence limits of and.041 for E s. Table Samle Data for a One-Way Analysis of Variance Grou 1 Grou Grou 3 Grou Exact Confidence Intervals for contrast Rosenthal et al. (000) discussed the samle statistic r contrast, which is the squared artial correlation between the contrast weight vector discussed in the revious section and the scores in y, with all other sources of systematic betweengrous variation artialed out. Consider the data discussed in the receding examle. These weights haen to be the rescaled orthogonal olynomial weights for testing quadratic trend. The remaining sources of between-grous variation may be redicted from any orthogonal comlement of the quadratic trend contrast weights. Consequently, if we construct the vectors with columns of reeated linear and cubic contrast weights, the artial correlation between y and the contrast weights with the quadratic and cubic weights artialed out is r contrast, which may also be comuted directly from the standard F statistic for the contrast as r contrast F contrast. (48) F contrast df within Rosenthal et al. (000) did not discuss samling theory for r contrast. However, a oulation equivalent, contrast,may be defined, and it may be shown (see Aendix B) that, with grous in the analysis, F contrast r contrast 1 r contrast /N tot (49) has a noncentral F distribution with 1 and N tot degrees of freedom and noncentrality arameter contrast N tot. (50) 1 contrast Consequently, one may construct a confidence interval for contrast by comuting a confidence interval for and transforming the endoints, using the result of Equation 37. Examle 9: Consider again the data in Table. We can comute the F statistics corresonding to linear, quadratic, and cubic trend and, for each trend, comute confidence intervals for contrast and/or contrast. For examle, consider the test for linear trend. The F statistic is 16, with 1 and 8 degrees of freedom, and the 95% confidence interval for the noncentrality arameter has endoints of and Consequently, from Equation 37, a 95% confidence interval for contrast has endoints of lower , uer (51) The confidence interval for contrast (defined as the square

11 174 STEIGER root of contrast, thus excluding negative values as in Rosenthal et al., 000) ranges from.905 to.988. Table 3 shows the results of comuting contrast correlations and the associated confidence intervals for linear, quadratic, and cubic trend. Some brief comments are in order. Note, first, that although the r contrast values for quadratic and cubic trends are aealingly high, the corresonding confidence intervals are quite wide and include zero. On the other hand, the confidence interval for the linear trend is very narrow. The Relationshi Between Confidence Intervals and Hyothesis Tests Choosing the Aroriate Interval Confidence intervals on measures of effect size convey all the information in a hyothesis test, and more. If one selects an aroriate confidence interval, a hyothesis test may be erformed simly by insection. If the confidence interval excludes the null hyothesized value, then the null hyothesis is rejected. In such alications, I recommend using the traditional two-sided confidence interval, rather than a one-sided interval (or confidence bound), regardless of whether the hyothesis test is one-sided or two-sided. When a twosided confidence interval is used to erform the hyothesis test, the confidence level must be matched aroriately both to the tye of hyothesis test and to the Tye I error rate. Recall that the endoints of the two-sided confidence interval for a arameter at the 100(1 )% confidence level are the values of that lace the observed statistic ˆ at the / or 1 / cumulative robability oint. Suose the uer and lower limits of the 100(1 )% confidence interval are U and L, resectively. Then ˆ is the rejection oint at the / significance level for one-sided hyothesis tests that is, first, greater than or equal to U and, second, less than or equal to L. The observed statistic ˆ is also equal to (a) the uer rejection oint for a two-sided test that L at the alha level and (b) the lower rejection oint for the two-sided test that U at the alha level. Consequently, the endoints of the confidence interval reresent two values of that the observed statistic would barely reject in a two-sided test with significance level Table 3 Confidence Intervals (CIs) for Contrast Correlations Statistic Linear Quadratic Cubic F r contrast CI r contrast CI alha. These endoints are also aroriate for testing one-sided hyotheses at the / significance level. The receding aragrah imlies a general rule of thumb: to use the confidence intervals to test a statistical hyothesis and to maintain a Tye I error rate at alha: 1. When testing a two-sided hyothesis at the alha level, use a 100(1 )% confidence interval.. When testing a one-sided hyothesis at the alha level, use a 100(1 )% confidence interval. Examle 10: Consider a test of the hyothesis that 0, that is, that the RMSSE (as defined in Equation 1) in an ANOVA is zero. This hyothesis test is one-sided, because the RMSSE cannot be negative. To use a twosided confidence interval to test this hyothesis at the.05 significance level, one should examine the 100(1 )% 90% confidence interval for. Ifthe confidence interval excludes zero, the null hyothesis will be rejected. This hyothesis test is equivalent to the standard ANOVA F test. Examle 11: Consider the test that the standardized effect size E s in Equation 45 is recisely zero. This hyothesis test is two-sided, because E s can be either ositive or negative. Consequently, to use a confidence interval to test this hyothesis at the.05 level, a 100(1 )% 95% two-sided confidence interval should be used, and the null hyothesis rejected only if both ends of the confidence interval are above zero or if both are below zero. Examle 1: Consider a situation in which one wishes to establish that the standardized effect size E s in Equation 45 is small, and that smallness is defined as an absolute value less than 0.0. To establish smallness, one must reject a hyothesis that E s is not small. Because E s can be either ositive or negative, E s can be not small in two directions. The hyothesis that E s is not small can therefore be tested with two simultaneous one-sided hyothesis tests, and H 01 : E s 0.0 versus H a1 : E s 0.0 (5) H 0 : E s 0.0 versus H a : E s 0.0. (53) These two hyotheses can both be tested simultaneously at the.05 level by constructing a 90% confidence interval and observing whether the lower end of the interval is above 0.0 (to test the first one-sided hyothesis) and the uer end of the interval is below 0.0. What this amounts to is observing whether the entire interval is between 0.0 and 0.0. If so, the hyothesis that E s is not small is rejected, and smallness is indicated.

12 BEYOND THE F TEST 175 Tests of Minimal Effect Rationale and Method In many situations, the null hyothesis of zero effect is inaroriate or can be misleading. For examle, in R-S testing with extremely large samle sizes, a null hyothesis may be rejected consistently, with a very low robability level, even when the oulation effect is small. Conversely, in A-S testing, the nil hyothesis of zero effect is often unreasonable, and the hyothesis the exerimenter robably wants to test is that the effect is trivial. Tests of minimal effect are a artial solution to the roblems caused by inaroriate testing of a nil hyothesis when the goal is to show that an effect is small. For examle, if some minimal reasonable effect size can be secified, rejection of the hyothesis that the effect is less than or equal to this value is of ractical imortance whether or not the samle size is very large. In the traditional A-S situation, in which the exerimenter is trying to show that an effect is trivial, the hyothesis that the effect is greater than or equal to a minimal reasonable value can be tested. Serlin and Lasley (1993) discussed this latter notion in detail and gave numerical examles. In such cases, large samle size will work for, rather than against, the exerimenter, because if the effect size is truly below a level that is of ractical imort, larger samles will yield greater ower to demonstrate that fact by rejecting the null hyothesis that the effect is at or above a oint of triviality. The confidence intervals described in the receding section can be used to test hyotheses of minimal effect: One simly observes whether the aroriately constructed confidence interval contains the target minimal reasonable value. For examle, suose you decide that an RMSSE of 0.5 constitutes a minimal reasonable effect. In other words, effects below that level may be ignored. Effects that are definitely above that level are nontrivial. If you wish to demonstrate that effects are trivial, you might test the hyotheses H 0 : 0.5; H 1 : 0.5. (54) On the other hand, if you wish to demonstrate that effects are definitely not trivial, you might test the hyotheses H 0 : 0.5; H 1 : 0.5. (55) In each case, rejecting the null hyothesis will suort the goal in erforming the test, and the roblems inherent in A-S testing can be avoided. A simle aroach to simultaneously testing the two hyotheses discussed above is to examine the 1 confidence interval for and see if it excludes 0.5. If the entire confidence interval is above the oint of triviality (i.e., 0.5), then the effect may be judged nontrivial. If the entire confidence interval is below the oint of triviality, then the effect has been shown to be trivial. There is a strong similarity between using the effect size confidence interval in this way and the long tradition of bioequivalence testing. Examle 13: Suose you have 6 grous and n 75 er grou. You observe an F statistic of F(5, 444).8, with.046, so the nil hyothesis of zero effects is rejected at the.05 significance level. However, on substantive grounds, you have decided that a value of less than 0.5 can be ignored. To demonstrate triviality, you would attemt to reject the null hyothesis that is greater than or equal to 0.5. There are two aroaches to erforming the test. The first aroach requires only a single calculation from the noncentral F distribution. Consider the cutoff value of 0.5. Using the result of Equation 1, one may convert this to a value for via the formula ( 1)n (6 1)(75)(.5 ) The observed F statistic of.8 has a one-sided robability value of.056 in the noncentral F distribution with , and 5 and 444 degrees of freedom, so the null hyothesis is rejected at the.05 level, and the overall effects are declared trivial. An alternative aroach uses the confidence interval. Note that, because the test is one-sided, we use the 90% confidence interval. The endoints of the interval for are and Using the result of Equation 1, we convert this confidence interval into a confidence interval for by dividing the above endoints by ( 1)n 375, then taking the square root. The resulting endoints for the confidence interval for are and This confidence interval excludes 0.5, so we can reject the hyothesis that effects are nontrivial, that is, 0.5, at the.05 significance level. The advantage of using the confidence interval is that it rovides us with an aroximate indication of the recision of the estimation rocess while still allowing us to erform the hyothesis test. Significant technical and theoretical issues surround the use of confidence intervals in this manner. 1. The choice of a numerical oint of triviality for a measure of omnibus effect size should not be treated as a mechanical selection from a small menu of aroved choices. Rather, it should be considered carefully on the basis of the secific exerimental design and the substantive asects of the variables being measured and maniulated. Whereas.5 might be considered trivial in one exeriment, it might be considered very imortant in another.

13 176 STEIGER. The ower of both hyothesis tests must be analyzed a riori to assess whether samle size is adequate. With low recision (i.e., a wide confidence interval), one might still have high ower to demonstrate nontrivial effects if effects are large. However, it is virtually imossible to demonstrate triviality if recision is low, because the triviality oint will be close to zero, and a wide confidence interval will not fit between zero and the triviality oint. Full consideration of the technical asects of estimating the oint of triviality, and recision of a arameter estimate and the resulting confidence interval, is beyond the scoe (and length restrictions) of this article. However, in the next section, I discuss several theoretical issues that the sohisticated user should kee in mind. Conclusions and Discussion This article demonstrates that the F statistic in ANOVA contains information about standardized effect size, and its recision of estimation, that has not been made available in tyical social science reorts and is not reorted by traditional software ackages. Yet this information can readily be calculated, using a few basic techniques. The fact is, simly reorting an F statistic, and a robability level attached to a hyothesis of nil effect, is so subotimal that its continuance can no longer be justified, at least in a social science tradition that rides itself on emiricism. A number of the field s most influential commentators on social statistics have emhasized this and urged that, as researchers, we revise our aroach to reorting the results of significance tests (e.g., see articles in Harlow, Mulaik, & Steiger, 1997). Null hyothesis testing is the source of much controversy. I have tried to romote an eclectic, integrated oint of view that resists the temtation to downgrade either the hyothesis testing or the interval estimation aroaches and emhasizes how they comlement each other. Reviewers and other readers of the article have rovided much food for thought and have raised several substantive criticisms that enriched my oint of view considerably. In the following sections, I discuss some of the limitations of the rocedures in this article, deal exlicitly with several of the more common objections to my major suggestions, and then summarize my oint of view and resent some conclusions. Statistical Limitations and Extensions of the Present Procedures The rocedures discussed in this article rovide exact distributional results under standard ANOVA assumtions (indeendence, normality, and equal variances) and are easily calculated with modern software. However, they are restricted to (a) comletely between-subjects fixed-effect ANOVA with (b) equal n er cell. The resent article does not resent rocedures for dealing with the comlications that result from unbalanced designs and/or reeated measures, nor does it discuss extensions to random effects or mixed ANOVA models or to multivariate analyses. In some cases, rocedures for these other situations are already available. Consider, for examle, the case of oneway random effects ANOVA. The treatment effects are random variables with a variance of A, and may be redefined as A /. A 100(1 )% confidence interval for may therefore be obtained in the equal n case by taking the square root of the well-known (Glass & Hokins, 1996,. 54) confidence interval for A /. One obtains, with grous, lower maxn 1 F obs F* / 1, 0, uer maxn 1 F obs F* 1/ 1, 0. (56) F obs is the observed value of the F statistic, and F* isthe ercentage oint from the F distribution with 1 and (n 1) degrees of freedom. This aroach can be generalized to more comlicated designs. Burdick and Graybill (199) discussed general methods for obtaining exact confidence intervals for and related quantities in random effects models, both in the equal n and unbalanced cases. Comutational rocedures for the unbalanced case are much more comlicated than for the case of equal n. However, on close insection, some extensions yield challenging comlications that require careful analysis. Some examles are as follows. 1. In the unbalanced, fixed-effects case, the noncentrality arameter is defined as follows: n j1 j j. (57) Note that with defined as in Equation 9, the quantity f as defined in Equation 10 reresents the ratio of betweengrous to within-grou variance in a oulation with robability of membershi in the treatment grous roortional to the samle sizes in the ANOVA. There are situations in which this quantity is of interest (such as when the samling lan reflects the relative size of natural suboulations) and others in which it might not be. Cohen (1988, ) discussed this oint in detail.. In reeated measures ANOVA designs, the noncen-

14 BEYOND THE F TEST 177 trality arameter unfortunately confounds effects of treatments with the correlation among observations. For examle, in a one-way within-subjects design, if the data ossess comound symmetry, the noncentrality arameter is n 1 j1 j. (58) The RMSSE,, asdefined in Equation 1, though still an aroriate measure of effect size, cannot be estimated directly using the exact techniques discussed in this article, unless is known. For a detailed discussion of this issue in the context of oint estimation in metaanalysis, see Dunla, Cortina, Vaslow, and Burke (1996). 3. In multivariate analysis, the noncentrality arameter includes information about the variances and correlations of the deendent variables. For examle, when two oulations are comared on k deendent variables, using Hotelling s T with two indeendent samles of size n 1 and n, the standard F statistic has k and n 1 n k 1 degrees of freedom and has a noncentral F distribution with a noncentrality arameter that is a simle function of the squared oulation Mahalanobis distance : The latter, comuted as n 1n n 1 n. (59) (60) with 1 and the oulation mean vectors, and the common covariance matrix, may be described as a sum of squared orthogonalized and standardized mean differences. Consequently, a natural analogue of Equation 1 that takes into account the number of deendent variables is k. (61) A confidence interval on may be calculated easily (Reiser, 001) from a confidence interval on, using the results of Equation 59. This interval may, in turn, be transformed into a confidence interval on using Equation 61. We see that in one of the contexts discussed above (reeated measures), deendencies between measures are an annoying confound that must be removed from consideration. In another (the case of two oulations), they are an essential ingredient for roer evaluation of effect size. In some of the roblematic cases discussed above, and in situations where the standard ANOVA statistical assumtions are inaroriate, resamling methods such as bootstraing can be used to obtain aroriate confidence intervals. The width of a confidence interval often is described as indicating recision of measurement. However, as Steiger and Fouladi (1997, ) ointed out, this relationshi is less than erfect and is seriously comromised in some situations for several reasons. The width of a confidence interval is itself a random variable and is subject to samling variations. Moreover, the confidence intervals are truncated at zero to avoid imroer estimates. In extreme cases, a confidence interval might actually have 0 as both endoints. This zero-width confidence interval obviously does not imly that effect size was determined with erfect recision. Focused Contrasts or Omnibus Hyotheses? In an early version of this article, I concentrated almost exclusively on omnibus measures of effect size. Several reviewers have objected that confidence intervals on measures of standardized effect size such as and the RMSSE were, to arahrase, an elegant solution to the wrong roblem. These writers have echoed the view of Rosenthal et al. (000), who stated that omnibus questions seldom address questions of real interest to researchers, and are tyically less owerful than focused rocedures (. 1). I share an enthusiasm for focused contrasts and recommend them in lieu of an omnibus test whenever researchers have clear ideas about linear contrasts. Moreover, I believe that not enough researchers have been trained to look carefully for ways to hrase their ideas as contrasts. However, I think that dismissing the imrovements to the use of the F statistic suggested in this article ignores several imortant realities. First, much research in the social sciences is exloratory, and an omnibus F test in such circumstances may be the relude to subsequent examination of unlanned contrasts. In such cases, an overall measure of the strength of effect sizes, and the recision with which they have been determined, may alert the researcher in advance to a lack of overall recision in the exerimental design. Second, when one is comaring several studies that have reorted overall F tests, comaring confidence intervals on standardized effect size measures can be very useful in resolving aarent disarities in exerimental outcomes. As it turns out, the confidence interval on contrast, one standardized measure of omnibus effect size, is closely related, concetually and comutationally, to the rocedure for comuting a confidence interval on. The latter index examines the squared multile correlation between observed data and a set of contrast weights, whereas the former examines the squared correlation between the data and one set of contrast weights with the variation redicted by the comlementary contrasts artialed out. Thus, the

15 178 STEIGER same technology that I find useful for omnibus tests may be alied directly to contrasts. I believe that reorting an exact confidence interval on contrast is substantially more informative than simly reorting the raw coefficient. And, to be clear, I fully suort concentration on focused contrasts in lieu of omnibus tests whenever the exerimenter has firm questions that suit the contrast analysis framework. Some Recent Objections to Standardized Measures of Effect Size Revised hyothesis-testing strategies for ANOVA require secification of target values of a standardized measure of effect size. The confidence interval aroach is more relaxed but strongly tends to lead the exerimenter to consider which overall effect sizes qualify as trivial and which are nontrivial in a articular alication. Although many writers have emhasized the value of standardized measures of effect size in ower analysis and samle size estimation, standardized effect size measures do have some shortcomings. As a nonlinear combination of several sources of variation in an exeriment, they reduce several values into one and are of necessity less recise than similar indices comuted on a focused contrast. Moreover, ANOVA effects as used in the calculation of the noncentrality arameter in the omnibus test may or may not corresond to exerimental effects as commonly concetualized (see, e.g., Steiger & Fouladi, 1997, ), and focused contrasts can get at such exerimental effects much more effectively than an omnibus rocedure. Recently, Lenth (001) suggested disensing with standardized measures of effect size altogether in the context of ower analysis and samle size estimation. His main justification was that combining information about raw effects (i.e., mean differences) and variation ignored a ossible confounding imact of reliability of measurement. Reconciling the Interval Estimation and Minimal- Effect-Testing Aroaches As stated at the outset, this article discusses two major aroaches that might be used to relace the traditional F test in ANOVA. The noncentrality interval estimation aroach emhasizes estimation of some function of overall effect size, along with an indication of the recision of the measurement. The dual hyothesis testing aroach relaces the hyothesis of nil effect with two hyotheses, one that the effect is trivial, the other that it is nontrivial. The aroach I ersonally favor is confidence interval estimation on some standardized measure of overall effect size. This aroach may be viewed as relacing hyothesis testing entirely, yet it can be used to erform both kinds of hyothesis tests required by the dual hyothesis-testing framework. Secifically, one simly examines, simultaneously, whether the confidence interval excludes a trivial effect value on the left or right. If, for examle, the confidence interval lies entirely above the cutoff oint for a trivial effect, one rejects the hyothesis of triviality. If the confidence interval lies entirely below the cutoff oint, one rejects the hyothesis of nontriviality. Moreover, the confidence interval aroach, being an exact rocedure, also rovides all the information available in the standard F test. For examle, the F test results in rejection at the.05 level if and only if the 90% confidence interval for excludes zero. The hyothesis-testing aroach offers advantages as well. For one, it kees the analysis within the comfortably familiar bounds of hyothesis testing. For another, it is comutationally easier one may erform the hyothesis test without extensive iteration, and so it may be erformed with a wider range of available free software. Another advantage is that, by simultaneously analyzing ower for both a test of triviality and a test of nontriviality, the user can be relatively certain that the confidence interval, if calculated, will have enough recision to determine whether effects are trivial or not. Standardized Effects and Coefficients of Determination A Caution Any statistical technique offers oortunity for abuse and misuse, esecially if the technique is used mechanically and without taking into account the secial circumstances surrounding a articular set of data. Abelson (1995) discussed in detail how imortant it is to remain oen-minded when judging the imortance of effect sizes. In some cases, effects that seem small may be quite imortant. This should be ket in mind before effects that are nonzero, but seemingly trivial, are dismissed. Abelson s comments are similar to Cohen s (1988, ) in his chater on secial issues in ower analysis. Casting a Vote for Change A fundamental contribution to behavioral statistics by Cohen (196) was to demonstrate that many studies lack sufficient statistical ower. The initial emhasis on ower analysis searheaded by Cohen (196) has now given way to a more sohisticated emhasis on recision of estimation. Confidence intervals on standardized measures of effect size allow one to assess how recisely effects have been measured and simultaneously assess whether the exeriment has ruled out (a) the notion that effects are trivial and (b) the notion that they are nontrivial. The rocedures are straightforward and offer obvious benefits. It is time for a change. Yet there are numerous obstacles to change in

16 BEYOND THE F TEST 179 behavioral statistics ractice. A significant obstacle is the dominant influence a few commercial statistical ackages such as SPSS and SAS have on ractice in the field. The way sychology has oerated in the ast, rocedures are unlikely to be used until they have been imlemented in a widely used statistics ackage, and commercial statistics ackages tend to be conservative toward new aroaches. In the final analysis, the imetus for change may have to come from journal editors and ractitioners, some of whom have resisted change for a variety of reasons discussed by Thomson (1999). Fortunately, the Internet makes it ossible to distribute innovative software to ractitioners very easily at virtually zero cost. There is no longer any reason to reort a squared multile correlation, an ANOVA F statistic, or a focused contrast t test without roviding information about confidence intervals on standardized effects. Each reader of this article can cast votes for change by obtaining the freeware I (and other authors) have made available, and then, when reviewing articles that reort omnibus tests and focused contrasts without associated intervals, taking two simle stes: (a) erforming their own calculation of confidence intervals on standardized effect size and (b) requesting that the author include this information in the ublished article. References Abelson, R. P. (1995). Statistics as rinciled argument. Hillsdale, NJ: Erlbaum. Burdick, R. K., & Graybill, F. A. (199). Confidence intervals on variance comonents. New York: Dekker. Casella, G., & Berger, R. L. (00). Statistical inference (nd ed.). Pacific Grove, CA: Duxbury. Chow, S.-C., & Liu, J.-P. (000). Design and analysis of bioavailability and bioequivalence studies. New York: Dekker. Cohen, J. (196). The statistical ower of abnormal social sychological research. Journal of Abnormal and Social Psychology, 65, Cohen, J. (1988). Statistical ower analysis for the behavioral sciences (nd ed.). Mahwah, NJ: Erlbaum. Cohen, J. (1994). The earth is round (.05). American Psychologist, 49, Dunla, W. P., Cortina, J. M., Vaslow, J. M., & Burke, M. J. (1996). Meta-analysis of exeriments with matched grous or reeated measures designs. Psychological Methods, 1, Fleishman, A. E. (1980). Confidence intervals for correlation ratios. Educational and Psychological Measurement, 40, Glass, G. V., & Hokins, K. D. (1996). Statistical methods in education and sychology (3rd ed.). Needham Heights, MA: Allyn & Bacon. Harlow, L. L., Mulaik, S. A., & Steiger, J. H. (1997). What if there were no significance tests? Mahwah, NJ: Erlbaum. Hedges, L. V., & Olkin, I. (1985). Statistical methods for metaanalysis. New York: Academic Press. Lenth, R. V. (001). Some ractical guidelines for effective samle size determination. American Statistician, 55, MacCallum, R. C., Browne, M. W., & Sugawara, H. M. (1996). Power analysis and determination of samle size for covariance structure modeling. Psychological Methods, 1, Mels, G. (1989). A general system for ath analysis with latent variables. Unublished master s thesis, University of South Africa, Pretoria, South Africa. Metzler, C. M. (1974). Bioavailability: A roblem in equivalence. Biometrics, 30, Ozer, D. J. (1985). Correlation and the coefficient of determination. Psychological Bulletin, 97, Reiser, B. (001). Confidence intervals for the Mahalanobis distance. Communications in Statistics, Simulation and Comutation, 30, Rosenthal, R. (1991). Effect sizes: Pearson s correlation, its dislay via the BESD, and alternative indices. American Psychologist, 46, Rosenthal, R., Rosnow, R. L., & Rubin, D. B. (000). Contrasts and effect sizes in behavioral research: A correlational aroach. New York: Cambridge University Press. Rosnow, R. L., & Rosenthal, R. (1996). Comuting contrasts, effect sizes, and counternulls on other eole s ublished data: General rocedures for research consumers. Psychological Methods, 1, Samson, A. R. (1974). A tale of two regressions. Journal of the American Statistical Association, 69, Schmidt, F. L. (1996). Statistical significance testing and cumulative research in sychology: Imlications for the training of researchers. Psychological Methods, 1, Schmidt, F. L., & Hunter, J. E. (1997). Eight common but false objections to the discontinuation of significance testing in the analysis of research data. In L. L. Harlow, S. A. Mulaik, & J. H. Steiger (Eds.), What if there were no significance tests? ( ). Mahwah, NJ: Erlbaum. Searle, S. R. (1987). Linear models for unbalanced data. New York: Wiley. Serlin, R. A., & Lasley, D. K. (1993). Rational araisal of sychological research and the good-enough rincile. In G. Keren & C. Lewis (Eds.), A handbook for data analysis in the behavioral sciences: Methodological issues ( ). Hillsdale, NJ: Erlbaum. Smithson, M. (001). Correct confidence intervals for various regression effect sizes and arameters: The imortance of noncentral distributions in comuting intervals. Educational and Psychological Measurement, 61, Steiger, J. H. (1989). EzPATH: A sulementary module for SYS- TAT and SYGRAPH. Evanston, IL: Systat. Steiger, J. H. (1990, October). Noncentrality interval estimation and the evaluation of statistical models. Paer resented at the meeting of the Society of Multivariate Exerimental Psychology, Kingston, RI.

17 180 STEIGER Steiger, J. H. (1999). STATISTICA ower analysis. Tulsa, OK: StatSoft. Steiger, J. H., & Fouladi, R. T. (199). R: A comuter rogram for interval estimation, ower calculation, and hyothesis testing for the squared multile correlation. Behavior Research Methods, Instruments, and Comuters, 4, Steiger, J. H., & Fouladi, R. T. (1997). Noncentrality interval estimation and the evaluation of statistical models. In L. L. Harlow, S. A. Mulaik, & J. H. Steiger (Eds.), What if there were no significance tests? (. 1 57). Mahwah, NJ: Erlbaum. Steiger, J. H., & Lind, J. C. (1980, May). Statistically based tests for the number of factors. Paer resented at the meeting of the Psychometric Society, Iowa City, IA. Steiger, J. H., & Ward, L. M. (1987). Factor analysis and the coefficient of determination. Psychological Bulletin, 99, Taylor, D. J., & Muller, K. E. (1995). Comuting confidence bounds for ower and samle size of the general linear univariate model. The American Statistician, 49, Taylor, D. J., & Muller, K. E. (1996). Bias in linear model ower and samle size calculation due to estimating noncentrality. Communications in Statistics: Theory and Methods, 5, Thomson, B. (1999). Why encouraging effect size reorting is not working: The etiology of researcher resistance to changing ractices. The Journal of Psychology, 133, Venables, W. (1975). Calculation of confidence intervals for noncentrality arameters. Journal of the Royal Statistical Society, Series B, 37, Westlake, W. J. (1976). Symmetrical confidence intervals for bioequivalence trials. Biometrics, 3, Aendix A The Relationshi Between and in One-Way ANOVA Define, for the samle means, s x 1 1 x j x. j1 (A1) Moreover, the lim of a samle moment is equal to the corresonding oulation moment. We define as lim(r ), that is, the value that R converges to in an infinite oulation. Then limr lim SS treatments SS treatments SS error The corresonding oulation quantity is s 1 1 j1 j XQ 1X, (A) n 1 where, X, and Q 1 are as described in Equations 9 through 38. In a balanced, one-way ANOVA, with grous and n observations er grou, SS treatments n( 1)s x. Consider any estimator ˆ of a arameter. The robability limit of, ˆ denoted lim( ˆ), is equal to a value c if and only if for any error tolerance 0, we have lim n3 Prˆ c 1. limn 1s x limn 1s x limn 1MS error lims x lims x lim n3n limms 1n error s s 1 1s 1s. (A3) The notion of a robability limit is closely related to that of consistency, in that ˆ is a consistent estimator for if and only if lim n3 ( ) ˆ. In what follows, for brevity of notation, I simly write lim(x) rather than lim n3 (X). I use a number of wellknown results. In articular, if lim(x) and lim(y) exist, then limx Y limx limy, limx/y limx/limy, limxy limxlimy. Combining Equations A1 through A3, we obtain and XQ 1 X n XQ 1 X n (A4) XQ 1 X N tot 1 N tot n XQ 1X, (A5) where is as defined in Equation 34.

18 BEYOND THE F TEST 181 Aendix B The Distribution of the F Statistic for r contrast Assume the general linear model as described in Equations 9 and 30. For any full column rank matrix A, define P A A(AA) 1 A, and Q A I P A, with I a conformable identity matrix. Define 1 to be a column of 1s. Partition X as X [1 x 1 X ]. x 1 contains relications of the contrast weights for the contrast being evaluated, so that the ith value in x 1 is the contrast weight for the grou that y i is in, and X contains a set of columns that are the orthogonal comlement of the contrast weights in x 1. Thus, for examle, if x 1 contains contrast weights for evaluating linear trend, X would contain quadratic and cubic contrast weights (or some full rank transformation of them). The regression weight vector is artitioned accordingly as. (B1) 0 1 Define as a vector of the oulation means of the grous, and c as the linear weights for the contrast of interest. In this case, the contrast of interest is c, (B) and because x 1 contains n relications of the elements of c, and E() contains n relications of the elements of, we have This statistic is a ratio of two quadratic forms, in the general form yay/a yby/b, (B7) where a 1, and b N tot. From Searle (1987, ), F contrast has a noncentral F distribution with a and b degrees of freedom and noncentrality arameter contrast XAX/ (B8) if A is idemotent, B is idemotent, AB 0, and a and b are the ranks of A and B, resectively. These four roerties are easily established by substitution and the fact that x 1, X, and 1 are airwise orthogonal. The orthogonality imlies that the noncentral F distribution has a noncentrality arameter equal to contrast XP x 1 X 1 x 1 x 1, (B9) and the ranks of the A and B are 1 and N tot, resectively. Next, we derive the relationshi between contrast and the oulation equivalent of r contrast. We may write r contrast as follows: and cc x 1 x 1 /n 1 n x 1 x 1. (B3) (B4) I first demonstrate that an F statistic may be constructed for r contrast. Rosenthal et al. (000) defined r contrast as the squared artial correlation between y and x 1 with X artialed out. This samle statistic can be comuted as the following ratio of quadratic forms in y: r contrast yp x1 y yi P X P 1 y. (B5) r contrast We define contrast contrast F contrast F contrast n 1 nˆ /MS error cc nˆ /MS error cc n 1 ˆ ˆ ccms error n 1/n. as limr contrast limˆ limˆ cc lim n3 n 1 n limms error (B10) Consider the statistic F contrast r contrast 1 r contrast /N tot Hence, cc. (B11) yp x1 y yi P x1 P X P 1 y/n tot. (B6) contrast 1 contrast cc, (B1) Aendix continues

19 18 STEIGER which, after substitution of Equations B3 and B4, becomes contrast /n x 1x contrast x 1 x 1 /n x 1x 1 1 n x 1x 1 1 N tot. (B13) Recalling the result of Equation B9 for contrast, we have thus shown that contrast x 1x 1 1 contrast N tot. (B14) 1 contrast Received February 3, 001 Revision received December 1, 003 Acceted January 8, 004