QUANTITATIVE METHODS IN PSYCHOLOGY A Power Primer

Transcription

1 QUANTITATIE METHODS IN PSYCHOLOGY A Power Primer Jcob Cohen New \brk University One possible reson for the continued neglect of sttisticl power nlysis in reserch in the behviorl sciences is the inccessibility of or difficulty with the stndrd mteril. A convenient, lthough not comprehensive, presenttion of required smple sizes is provided here. Effect-size indexes nd conventionl vlues for these re given for opertionlly defined smll, medium, nd lrge effects. The smple sizes necessry for.80 power to detect effects t these levels re tbled for eight stndrd sttisticl tests: () the difference between mens, (b) the significnce of product-moment correltion, (c) the difference between rs, (d) the sign test, (e) the difference between proportions, (f) chi-squre tests for goodness of fit nd contingency tbles, (g) one-wy nlysis of vrince, nd (h) the significnce of multiple or multiple prtil correltion. The prefce to the first edition of my power hndbook (Cohen, 1969) begins: During my first dozen yers of teching nd consulting on pplied sttistics with behviorl scientists, 1 becme incresingly impressed with the importnce of sttisticl power nlysis, n importnce which ws incresed n order of mgnitude by its neglect in our textbooks nd curricul. The cse for its importnce is esily mde: Wht behviorl scientist would view with equnimity the question of the probbility tht his investigtion would led to sttisticlly significnt results, i.e., its power? (p. vii) This neglect ws obvious through csul observtion nd hd been confirmed by power review of the 1960 volume of the Journl of Abnorml nd Socil Psychology, which found the men power to detect medium effect sizes to be.48 (Cohen, 1962). Thus, the chnce of obtining significnt result ws bout tht of tossing hed with fir coin. I ttributed this disregrd of power to the inccessibility of meger nd mthemticlly difficult literture, beginning with its origin in the work of Neymn nd Person (1928,1933). The power hndbook ws supposed to solve the problem. It required no more bckground thn n introductory psychologicl sttistics course tht included significnce testing. The exposition ws verbl-intuitive nd crried lrgely by mny worked exmples drwn from cross the spectrum of behviorl science. In the ensuing two decdes, the book hs been through revised (1977) nd second (1988) editions nd hs inspired dozens of power nd effect-size surveys in mny res of the socil nd life sciences (Cohen, 1988, pp. xi-xii). During this period, there hs been spte of rticles on power nlysis in the socil science literture, bker's dozen of computer progrms (re- I m grteful to Ptrici Cohen for her useful comments. Correspondence concerning this rticle should be ddressed to Jcob Cohen, Deprtment of Psychology, New >brk University, 6 Wshington Plce, 5th Floor, New York, New York viewed in Goldstein, 1989), nd brekthrough into populr sttistics textbooks (Cohen, 1988, pp. xii-xiii). Sedlmeier nd Gigerenzer (1989) reported power review of the 1984 volume of the Journl of Abnorml Psychology (some 24 yers fter mine) under the title, "Do Studies of Sttisticl Power Hve n Effect on the Power of Studies?" The nswer ws no. Neither their study nor the dozen other power reviews they cite (excepting those fields in which lrge smple sizes re used, e.g., sociology, mrket reserch) showed ny mteril improvement in power. Thus, qurter century hs brought no increse in the probbility of obtining significnt result. Why is this? There is no controversy mong methodologists bout the importnce of power nlysis, nd there re mple ccessible resources for estimting smple sizes in reserch plnning using power nlysis. My 2-decdes-long expecttion tht methods sections in reserch rticles in psychologicl journls would invribly include power nlyses hs not been relized. Indeed, they lmost invribly do not. Of the 54 rticles Sedlmeier nd Gigerenzer (1989) reviewed, only 2 mentioned power, nd none estimted power or necessry smple size or the popultion effect size they posited. In 7 of the studies, null hypotheses served s reserch hypotheses tht were confirmed when the results were nonsignificnt. Assuming medium effect size, the medin power for these tests ws.! Thus, these uthors concluded tht their reserch hypotheses of no effect were supported when they hd only. chnce of rejecting these null hypotheses in the presence of substntil popultion effects. It is not t ll cler why reserchers continue to ignore power nlysis. The pssive cceptnce of this stte of ffirs by editors nd reviewers is even more of mystery. At lest prt of the reson my be the low level of consciousness bout effect size: It is s if the only concern bout mgnitude in much psychologicl reserch is with regrd to the sttisticl test result nd its ccompnying p vlue, not with regrd to the psychologicl phenomenon under study. Sedlmeier nd Gigerenzer (1989) ttribute this to the ccident of the historicl precedence of Fi- Psychologicl Bulletin, 1992, ol No. 1, Copyright 1992 by the Americn Psychologicl Assocition, Inc /92/S

2 156 JACOB COHEN sherin theory, its hybridiztion with the contrdictory Neymn-Person theory, nd the pprent completeness of Fisherin null hypothesis testing: objective, mechnicl, nd clercut go-no-go decision strddled over p =.05.1 hve suggested tht the neglect of power nlysis simply exemplifies the slow movement of methodologicl dvnce (Cohen, 1988, p. xiv), noting tht it took some 40 yers from Student's publiction of the / test to its inclusion in psychologicl sttistics textbooks (Cohen, 1990, p. 1311). An ssocite editor of this journl suggests nother reson: Reserchers find too complicted, or do not hve t hnd, either my book or other reference mteril for power nlysis. He suggests tht short rule-of-thumb tretment of necessry smple size might mke difference. Hence this rticle. In this bre bones tretment, I cover only the simplest cses, the most common designs nd tests, nd only three levels of effect size. For reders who find this indequte, I unhesittingly recommend Sttistic Power Anlysis for the Behviorl Sciences (Cohen, 1988; herefter SPABS). It covers specil cses, one-sided tests, unequl smple sizes, other null hypotheses, set correltion nd multivrite methods nd gives substntive exmples of smll, medium, nd lrge effect sizes for the vrious tests. It offers well over 100 worked illustrtive exmples nd is s user friendly s I know how to mke it, the technicl mteril being relegted to n ppendix. Method Sttisticl power nlysis exploits the reltionships mong the four vribles involved in sttisticl inference: smple size (N), significnce criterion (ft), popultion effect size (ES), nd sttisticl power. For ny sttisticl model, these reltionships re such tht ech is function of the other three. For exmple, in power reviews, for ny given sttisticl test, we cn determine power for given, N, nd ES. For reserch plnning, however, it is most useful to determine the N necessry to hve specified power for given nd ES; this rticle ddresses this use. The Significnce Criterion, The risk of mistkenly rejecting the null hypothesis (H) nd thus of committing Type I error,, represents policy: the mximum risk ttending such rejection. Unless otherwise stted (nd it rrely is), it is tken to equl.05 (prt of the Fisherin legcy; Cohen, 1990). Other vlues my of course be selected. For exmple, in studies testing severl fys, it is recommended tht -.01 per hypothesis in order tht the experimentwise risk (i.e., the risk of ny flse rejections) not become too lrge. Also, for tests whose prmeters my be either positive or negtive, the risk my be defined s two sided or one sided. The mny tbles in SPABS provide for both kinds, but the smple sizes provided in this note re ll for two-sided tests t =.01,.05, nd. 10, the lst for circumstnces in which less rigorous stndrd for rejection is desired, s, for exmple, in explortory studies. For unreconstructed one tilers (see Cohen, 1965), the tbled smple sizes provide close pproximtions for one-sided tests t i (e.g., the smple sizes tbled under = my be used for one-sided tests t =.05). Power The sttisticl power of significnce test is the long-term probbility, given the popultion ES,, nd T of rejecting /&. When the ES is not equl to zero, H, is flse, so filure to reject it lso incurs n error. This is Type II error, nd for ny given ES,, nd N, its probbility of occurring is ft. Power is thus 1-0, the probbility of rejecting flse H,. In this tretment, the only specifiction for power is.80 (so /3 =.), convention proposed for generl use. (SPABS provides for 11 levels of power in most of its N tbles.) A mterilly smller vlue thn.80 would incur too gret risk of Type II error. A mterilly lrger vlue would result in demnd for N tht is likely to exceed the investigtor's resources. Tken with the conventionl =.05, powerof.80 results in 0M rtio of 4:1 (. to.05) of the two kinds of risks. (See SPABS, pp ) ple Size In reserch plnning, the investigtor needs to know the N necessry to ttin the desired power for the specified nd hypothesized ES. A'increses with n increse in the power desired, decrese in the ES, nd decrese in. For sttisticl tests involving two or more groups, Ns here denned is the necessry smple size for ech group. Effect Size Reserchers find specifying the ES the most difficult prt of power nlysis. As suggested bove, the difficulty is t lest prtly due to the generlly low level of consciousness of the mgnitude of phenomen tht chrcterizes much of psychology. This in turn my help explin why, despite the stricture of methodologists, significnce testing is so hevily preferred to confidence intervl estimtion, lthough the wide intervls tht usully result my lso ply role (Cohen, 1990). However, neither the determintion of power or necessry smple size cn proceed without the investigtor hving some ide bout the degree to which the H, is believed to be flse (i.e., the ES). In the Neymn-Person method of sttisticl inference, in ddition to the specifiction of HQ, n lternte hypothesis (//,) is counterpoised ginst fy. The degree to which H> is flse is indexed by the discrepncy between H, nd //, nd is clled the ES. Ech sttisticl test hs its own ES index. All the indexes re scle free nd continuous, rnging upwrd from zero, nd for ll, the /^ is tht ES = 0. For exmple, for testing the product-moment correltion of smple for significnce, the ES is simply the popultion r, so H posits tht r = 0. As nother exmple, for testing the significnce of the deprture of popultion proportion (P) from, the ES index isg= P, so the H, is tht g= 0. For the tests of the significnce of the difference between mens, correltion coefficients, nd proportions, the H is tht the difference equls zero. Tble 1 gives for ech of the tests the definition of its ES index. To convey the mening of ny given ES index, it is necessry to hve some ide of its scle. To this end, I hve proposed s conventions or opertionl definitions smll, medium, nd lrge vlues for ech tht re t lest pproximtely consistent cross the different ES indexes. My intent ws tht medium ES represent n effect likely to be visible to the nked eye of creful observer, (ft hs since been noted in effectsize surveys tht it pproximtes the verge size of observed effects in vrious fields.) I set smll ES to be noticebly smller thn medium but not so smll s to be trivil, nd I set lrge ES to be the sme distnce bove medium s smll ws below it. Although the definitions were mde subjectively, with some erly minor djustments, these conventions hve been fixed since the 1977 edition of SPABS nd hve come into generl use. Tble 1 contins these vlues for the tests considered here. In the present tretment, the H,s re the ESs tht opertionlly define smll, medium, nd lrge effects s given in Tble 1. For the test of the significnce of smple r, for exmple, becuse the ES for this test is simply the lternte-hypotheticl popultion r, smll, medium, nd lrge ESs re respectively,, nd. The ES index for the t test of the difference between mens is d, the difference

3 A POWER PRIMER 157 Tble 1 ES Indexes nd Their lues for ll, ium, nd Lrge Effects Effect size 1. Test ES index m A vs. m B for, m A m B mens 2. Significnce r of productmoment r 3. r A vs. r B for q = Z A - Z B where z = Fisher's z rs 4. P =.5 nd = P - the sign test 5. P A vs. P B for h = <t> A <t> B where 0 = rcsine trnsformtion proportions, 6. Chi-squre, /^ (/> - P 0 /) 2 for goodness \ / p of fit nd contingency 7. One-wy,_, J nlysis of vrince 8. Multiple nd f2 R 2 multiple J \ - R 2 prtil correltion ll ium Lrge Note. ES = popultion effect size. expressed in units of (i.e., divided by) the within-popultion stndrd devition. For this test, the /& is tht d= 0 nd the smll, medium, nd lrge ESs (or H,s) re d -.,, nd.80. Thus, n opertionlly defined medium difference between mens is hlf stndrd devition; concretely, for IQ scores in which the popultion stndrd devition is 15, medium difference between mens is 7.5 IQ points. Sttisticl Tests The tests covered here re the most common tests used in psychologicl reserch: 1. The t test for the difference between two mens, with df= 2 (N- 1). 2. The / test for the significnce of product-moment correltion coefficient r, with df= N The test for the difference between two rs, ccomplished s norml curve test through the Fisher z trnsformtion of r (tbled in mny sttisticl texts). 4. The binomil distribution or, for lrge smples, the norml curve (or equivlent chi-squre, 1 df) test tht popultion proportion (P) =. This test is lso used in the nonprmetric sign test for differences between pired observtions. 5. The norml curve test for the difference between two proportions, ccomplished through the rcsine trnsformtion <t> (tbled in mny sttisticl texts). The results re effectively the sme when the test is mde using the chi-squre test with 1 degree of freedom. 6. The chi-squre test for goodness of fit (one wy) or ssocition in two-wy contingency tbles. In Tble 1, k is the number of cells nd P Qi nd P v re the null hypotheticl nd lternte hypotheticl popultion proportions in cell /. (Note tht w's structure is the sme s chi-squre's for cell smple frequencies.) For goodness-of-fit tests, the df= k - 1, nd for contingency tbles, df= ( 1) (b 1), where nd b re the number of levels in the two vribles. Tble 2 provides (totl) smple sizes for 1 through 6 degrees of freedom. 7. One-wy nlysis of vrince. Assuming equl smple sizes (s we do throughout), for g groups, the Ftest hs df= g 1, g(n - 1). The ES index is the stndrd devition of the g popultion mens divided by the common within-popultion stndrd devition. Provision is mde in Tble 2 for 2 through 7 groups. 8. Multiple nd multiple prtil correltion. For k vribles, the significnce test is the stndrd F test for df= k,n k-\. The ES index, /*, is defined for either squred multiple or squred multiple prtil correltions (R 2 ). Tble 2 provides for 2 through 8 vribles. Note tht becuse ll tests of popultion prmeters tht cn be either positive or negtive (Tests 1-5) re two-sided, their ES indexes here re bsolute vlues. In using the mteril tht follows, keep in mind tht the ES posited by the investigtor is wht he or she believes holds for the popultion nd tht the smple size tht is found is conditionl on the ES. Thus, if study is plnned in which the investigtor believes tht popultion r is of medium size (ES = r - from Tble 1) nd the / test is to be performed with twosided =.05, then the power of this test is.80 if the smple size is 85 (from Tble 2). If, using 85 cses, t is not significnt, then

4 158 JACOB COHEN Tble 2 T for ll, ium, nd Lrge ES t Power =.80 for =.01,.05, nd Test 1. Men dif 2. Sigr 3. rdif 4. P=.5 5. Pdif 6. x 2 \df 2df Idf 4df 5df 6df 7. ANOA 2g" lg 5«* 6S" 8. Mult/? 2fc* 3/c* 4^ 5 * 6/c* Ik" 8/t* 586 1,1 2,3 1, ,168,8,546,5,787, , , ,090 1,194 1,293 1, , ,0 1,113 Note. ES = popultion effect size, = smll, = medium, = lrge, diff = difference, ANOA = nlysis of vrince. Tests numbered s in Tble 1. " Number of groups. * Number of vribles either r is smller then or the investigtor hs been the victim of the. (ft) risk of mking Type II error. Exmples The necessry N for power of.80 for the following exmples re found in Tble To detect medium difference between two smple mens (d= in Tble 1) t =.05 requires N= 64 in ech group. (A dof is equivlent to point-biseril correltion of.243; see SPABS, pp ) 2. For significnce test of smple rl =.01, when the popultion r is lrge ( in Tble 2), smple size = 41 is required. At =.05, the necessry smple size = To detect medium-sized difference between two popultion rs (q = in Tble 1) t =.05 requires N = 177 in ech group. (The following pirs of rs yield q = :.00,.29;.,.46;.40,.62;.60,.76;.80,.89;.90,.94; see SPABS, pp ) 4. The sign test tests the HO tht of popultion of pired differences re positive. If the popultion proportion^ deprture from is medium (q =.15 in Tble 1), t =, the necessry N= ; t =.05, it is To detect smll difference between two popultion proportions (h =. in Tble 1) t =.05 requires T = 2 cses in ech group. (The following pirs of Ps yield pproximte vlues of h =.:.05, ;.,.29;.40, ;.60,.70;.80,.87;.90,.95; see SPABS, p. 184f.) 6. A 3 X 4 contingency tble hs 6 degrees of freedom. To detect medium degree of ssocition in the popultion (w = in Tble 1) t =.05 requires N = 151. (w = corresponds to contingency coefficient of.287, nd for 6 degrees of freedom, Crmer < of.212; see SPABS, pp ). 7. A psychologist considers lternte reserch plns involving comprisons of the mens of either three or four groups in both of which she believes tht the ES is medium (/=. in Tble 1). She finds tht t =.05, the necessry smple size per group is 52 cses for the three-group pln nd cses for the four-group pln, thus, totl smple sizes of 156 nd 180. (When /=., the proportion of vrince ccounted for by group membership is.0588; see SPABS, pp ) 8. A psychologist plns reserch in which he will do multiple regression/correltion nlysis nd perform ll the significnce tests t =.01. For the F test of the multiple R 2, he expects medium ES, tht is, f 2 =. 15 (from Tble 1). He hs cndidte set of eight vribles for which Tble 2 indictes tht the required smple size is 147, which exceeds his resources. However, from his knowledge of the reserch re, he believes tht the informtion in the eight vribles cn be

5 A POWER PRIMER 159 effectively summrized in three. For three vribles, the necessry smple size is only 108. (Given the reltionship between f 2 nd R 2, the vlues for smll, medium, nd lrge R 2 re respectively.0196,.14, nd.92, nd for R,.14,.36, nd.51; see SPABS, pp ) References Cohen, J. (1962). The sttisticl power of bnorml-socil psychologicl reserch: A review. Journl of Abnorml nd Socil Psychology, 65, Cohen, J. (1965). Some sttisticl issues in psychologicl reserch. In B. B. Wolmn (Ed.), Hndbook of clinicl psychology (pp ). New York: McGrw-Hill. Cohen, J. (1969). Sttisticl power nlysis for the behviorl sciences. Sn Diego, CA: Acdemic Press. Cohen, J. (1988). Sttisticl power nlysis for the behviorl sciences (2nd ed.). Hillsdle, NJ: Erlbum. Cohen, J. (1990). Things I hve lerned (so fr). Americn Psychologist,, Goldstein, R. (1989). Power nd smple size vi MS/PC-DOS computers. Americn Sttisticin, 43, 3-0. Neymn, 1, & Person, E. S. (1928). On the use nd interprettion of certin test criteri for purposes of sttisticl inference. Biometrik, A, , Neymn, J., & Person, E. S. (1933). On the problem of the most efficient tests of sttisticl hypotheses. Trnsctions of the Royl Society of London Series A, 231, Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of sttisticl power hve n effect on the power of studies? Psychologicl Bulletin, 105, Received Februry 1,1991 Revision received April,1991 Accepted My 2,1991 Low Publiction Prices for APA Members nd Affilites Keeping You Up-to-Dcrte: All APA members (Fellows; Members; Assocites, nd Student Affilites) receive s prt of their nnul dues subscriptions to the Americn Psychobgist nd APA Monitor. High School Techer nd Interntionl Affilites receive subscriptions to the APA Monitor, nd they cn subscribe to the Americn Psychologist t significntly reduced rte. In ddition, ll members nd ffilites re eligible for svings of up to 60% (plus journl credit) on ll other APA journls, s well s significnt discounts on subscriptions from cooperting societies nd publishers (e.g., the Americn Assocition for Counseling nd Development, Acdemic Press, nd Humn Sciences Press). Essentil Resources: APA members nd ffilites receive specil rtes for purchses of APA books, including the Publiction Mnul of the APA, the Mster Lectures, nd Journls in Psychology: A Resource Listing for Authors. Other Benefits of Membership: Membership in APA lso provides eligibility for low-cost insurnce plns covering life, income protection, office overhed, ccident protection, helth cre, hospitl indemnity, professionl libility, reserch/cdemic professionl libility, student/school libility, nd student helth. For more informtion, write to Americn Psychologicl Assocition, Membership Services, 7 First Street, NE, Wshington, DC , USA