1 QUANTITATIE METHODS IN PSYCHOLOGY A Power Primer Jcob Cohen New \brk University One possible reson for the continued neglect of sttisticl power nlysis in reserch in the behviorl sciences is the inccessibility of or difficulty with the stndrd mteril. A convenient, lthough not comprehensive, presenttion of required smple sizes is provided here. Effectsize indexes nd conventionl vlues for these re given for opertionlly defined smll, medium, nd lrge effects. The smple sizes necessry for.80 power to detect effects t these levels re tbled for eight stndrd sttisticl tests: () the difference between mens, (b) the significnce of productmoment correltion, (c) the difference between rs, (d) the sign test, (e) the difference between proportions, (f) chisqure tests for goodness of fit nd contingency tbles, (g) onewy nlysis of vrince, nd (h) the significnce of multiple or multiple prtil correltion. The prefce to the first edition of my power hndbook (Cohen, 1969) begins: During my first dozen yers of teching nd consulting on pplied sttistics with behviorl scientists, 1 becme incresingly impressed with the importnce of sttisticl power nlysis, n importnce which ws incresed n order of mgnitude by its neglect in our textbooks nd curricul. The cse for its importnce is esily mde: Wht behviorl scientist would view with equnimity the question of the probbility tht his investigtion would led to sttisticlly significnt results, i.e., its power? (p. vii) This neglect ws obvious through csul observtion nd hd been confirmed by power review of the 1960 volume of the Journl of Abnorml nd Socil Psychology, which found the men power to detect medium effect sizes to be.48 (Cohen, 1962). Thus, the chnce of obtining significnt result ws bout tht of tossing hed with fir coin. I ttributed this disregrd of power to the inccessibility of meger nd mthemticlly difficult literture, beginning with its origin in the work of Neymn nd Person (1928,1933). The power hndbook ws supposed to solve the problem. It required no more bckground thn n introductory psychologicl sttistics course tht included significnce testing. The exposition ws verblintuitive nd crried lrgely by mny worked exmples drwn from cross the spectrum of behviorl science. In the ensuing two decdes, the book hs been through revised (1977) nd second (1988) editions nd hs inspired dozens of power nd effectsize surveys in mny res of the socil nd life sciences (Cohen, 1988, pp. xixii). During this period, there hs been spte of rticles on power nlysis in the socil science literture, bker's dozen of computer progrms (re I m grteful to Ptrici Cohen for her useful comments. Correspondence concerning this rticle should be ddressed to Jcob Cohen, Deprtment of Psychology, New >brk University, 6 Wshington Plce, 5th Floor, New York, New York viewed in Goldstein, 1989), nd brekthrough into populr sttistics textbooks (Cohen, 1988, pp. xiixiii). Sedlmeier nd Gigerenzer (1989) reported power review of the 1984 volume of the Journl of Abnorml Psychology (some 24 yers fter mine) under the title, "Do Studies of Sttisticl Power Hve n Effect on the Power of Studies?" The nswer ws no. Neither their study nor the dozen other power reviews they cite (excepting those fields in which lrge smple sizes re used, e.g., sociology, mrket reserch) showed ny mteril improvement in power. Thus, qurter century hs brought no increse in the probbility of obtining significnt result. Why is this? There is no controversy mong methodologists bout the importnce of power nlysis, nd there re mple ccessible resources for estimting smple sizes in reserch plnning using power nlysis. My 2decdeslong expecttion tht methods sections in reserch rticles in psychologicl journls would invribly include power nlyses hs not been relized. Indeed, they lmost invribly do not. Of the 54 rticles Sedlmeier nd Gigerenzer (1989) reviewed, only 2 mentioned power, nd none estimted power or necessry smple size or the popultion effect size they posited. In 7 of the studies, null hypotheses served s reserch hypotheses tht were confirmed when the results were nonsignificnt. Assuming medium effect size, the medin power for these tests ws.! Thus, these uthors concluded tht their reserch hypotheses of no effect were supported when they hd only. chnce of rejecting these null hypotheses in the presence of substntil popultion effects. It is not t ll cler why reserchers continue to ignore power nlysis. The pssive cceptnce of this stte of ffirs by editors nd reviewers is even more of mystery. At lest prt of the reson my be the low level of consciousness bout effect size: It is s if the only concern bout mgnitude in much psychologicl reserch is with regrd to the sttisticl test result nd its ccompnying p vlue, not with regrd to the psychologicl phenomenon under study. Sedlmeier nd Gigerenzer (1989) ttribute this to the ccident of the historicl precedence of Fi Psychologicl Bulletin, 1992, ol No. 1, Copyright 1992 by the Americn Psychologicl Assocition, Inc /92/S
2 156 JACOB COHEN sherin theory, its hybridiztion with the contrdictory NeymnPerson theory, nd the pprent completeness of Fisherin null hypothesis testing: objective, mechnicl, nd clercut gonogo decision strddled over p =.05.1 hve suggested tht the neglect of power nlysis simply exemplifies the slow movement of methodologicl dvnce (Cohen, 1988, p. xiv), noting tht it took some 40 yers from Student's publiction of the / test to its inclusion in psychologicl sttistics textbooks (Cohen, 1990, p. 1311). An ssocite editor of this journl suggests nother reson: Reserchers find too complicted, or do not hve t hnd, either my book or other reference mteril for power nlysis. He suggests tht short ruleofthumb tretment of necessry smple size might mke difference. Hence this rticle. In this bre bones tretment, I cover only the simplest cses, the most common designs nd tests, nd only three levels of effect size. For reders who find this indequte, I unhesittingly recommend Sttistic Power Anlysis for the Behviorl Sciences (Cohen, 1988; herefter SPABS). It covers specil cses, onesided tests, unequl smple sizes, other null hypotheses, set correltion nd multivrite methods nd gives substntive exmples of smll, medium, nd lrge effect sizes for the vrious tests. It offers well over 100 worked illustrtive exmples nd is s user friendly s I know how to mke it, the technicl mteril being relegted to n ppendix. Method Sttisticl power nlysis exploits the reltionships mong the four vribles involved in sttisticl inference: smple size (N), significnce criterion (ft), popultion effect size (ES), nd sttisticl power. For ny sttisticl model, these reltionships re such tht ech is function of the other three. For exmple, in power reviews, for ny given sttisticl test, we cn determine power for given, N, nd ES. For reserch plnning, however, it is most useful to determine the N necessry to hve specified power for given nd ES; this rticle ddresses this use. The Significnce Criterion, The risk of mistkenly rejecting the null hypothesis (H) nd thus of committing Type I error,, represents policy: the mximum risk ttending such rejection. Unless otherwise stted (nd it rrely is), it is tken to equl.05 (prt of the Fisherin legcy; Cohen, 1990). Other vlues my of course be selected. For exmple, in studies testing severl fys, it is recommended tht .01 per hypothesis in order tht the experimentwise risk (i.e., the risk of ny flse rejections) not become too lrge. Also, for tests whose prmeters my be either positive or negtive, the risk my be defined s two sided or one sided. The mny tbles in SPABS provide for both kinds, but the smple sizes provided in this note re ll for twosided tests t =.01,.05, nd. 10, the lst for circumstnces in which less rigorous stndrd for rejection is desired, s, for exmple, in explortory studies. For unreconstructed one tilers (see Cohen, 1965), the tbled smple sizes provide close pproximtions for onesided tests t i (e.g., the smple sizes tbled under = my be used for onesided tests t =.05). Power The sttisticl power of significnce test is the longterm probbility, given the popultion ES,, nd T of rejecting /&. When the ES is not equl to zero, H, is flse, so filure to reject it lso incurs n error. This is Type II error, nd for ny given ES,, nd N, its probbility of occurring is ft. Power is thus 10, the probbility of rejecting flse H,. In this tretment, the only specifiction for power is.80 (so /3 =.), convention proposed for generl use. (SPABS provides for 11 levels of power in most of its N tbles.) A mterilly smller vlue thn.80 would incur too gret risk of Type II error. A mterilly lrger vlue would result in demnd for N tht is likely to exceed the investigtor's resources. Tken with the conventionl =.05, powerof.80 results in 0M rtio of 4:1 (. to.05) of the two kinds of risks. (See SPABS, pp ) ple Size In reserch plnning, the investigtor needs to know the N necessry to ttin the desired power for the specified nd hypothesized ES. A'increses with n increse in the power desired, decrese in the ES, nd decrese in. For sttisticl tests involving two or more groups, Ns here denned is the necessry smple size for ech group. Effect Size Reserchers find specifying the ES the most difficult prt of power nlysis. As suggested bove, the difficulty is t lest prtly due to the generlly low level of consciousness of the mgnitude of phenomen tht chrcterizes much of psychology. This in turn my help explin why, despite the stricture of methodologists, significnce testing is so hevily preferred to confidence intervl estimtion, lthough the wide intervls tht usully result my lso ply role (Cohen, 1990). However, neither the determintion of power or necessry smple size cn proceed without the investigtor hving some ide bout the degree to which the H, is believed to be flse (i.e., the ES). In the NeymnPerson method of sttisticl inference, in ddition to the specifiction of HQ, n lternte hypothesis (//,) is counterpoised ginst fy. The degree to which H> is flse is indexed by the discrepncy between H, nd //, nd is clled the ES. Ech sttisticl test hs its own ES index. All the indexes re scle free nd continuous, rnging upwrd from zero, nd for ll, the /^ is tht ES = 0. For exmple, for testing the productmoment correltion of smple for significnce, the ES is simply the popultion r, so H posits tht r = 0. As nother exmple, for testing the significnce of the deprture of popultion proportion (P) from, the ES index isg= P, so the H, is tht g= 0. For the tests of the significnce of the difference between mens, correltion coefficients, nd proportions, the H is tht the difference equls zero. Tble 1 gives for ech of the tests the definition of its ES index. To convey the mening of ny given ES index, it is necessry to hve some ide of its scle. To this end, I hve proposed s conventions or opertionl definitions smll, medium, nd lrge vlues for ech tht re t lest pproximtely consistent cross the different ES indexes. My intent ws tht medium ES represent n effect likely to be visible to the nked eye of creful observer, (ft hs since been noted in effectsize surveys tht it pproximtes the verge size of observed effects in vrious fields.) I set smll ES to be noticebly smller thn medium but not so smll s to be trivil, nd I set lrge ES to be the sme distnce bove medium s smll ws below it. Although the definitions were mde subjectively, with some erly minor djustments, these conventions hve been fixed since the 1977 edition of SPABS nd hve come into generl use. Tble 1 contins these vlues for the tests considered here. In the present tretment, the H,s re the ESs tht opertionlly define smll, medium, nd lrge effects s given in Tble 1. For the test of the significnce of smple r, for exmple, becuse the ES for this test is simply the lterntehypotheticl popultion r, smll, medium, nd lrge ESs re respectively,, nd. The ES index for the t test of the difference between mens is d, the difference
3 A POWER PRIMER 157 Tble 1 ES Indexes nd Their lues for ll, ium, nd Lrge Effects Effect size 1. Test ES index m A vs. m B for, m A m B mens 2. Significnce r of productmoment r 3. r A vs. r B for q = Z A  Z B where z = Fisher's z rs 4. P =.5 nd = P  the sign test 5. P A vs. P B for h = <t> A <t> B where 0 = rcsine trnsformtion proportions, 6. Chisqure, /^ (/>  P 0 /) 2 for goodness \ / p of fit nd contingency 7. Onewy,_, J nlysis of vrince 8. Multiple nd f2 R 2 multiple J \  R 2 prtil correltion ll ium Lrge Note. ES = popultion effect size. expressed in units of (i.e., divided by) the withinpopultion stndrd devition. For this test, the /& is tht d= 0 nd the smll, medium, nd lrge ESs (or H,s) re d .,, nd.80. Thus, n opertionlly defined medium difference between mens is hlf stndrd devition; concretely, for IQ scores in which the popultion stndrd devition is 15, medium difference between mens is 7.5 IQ points. Sttisticl Tests The tests covered here re the most common tests used in psychologicl reserch: 1. The t test for the difference between two mens, with df= 2 (N 1). 2. The / test for the significnce of productmoment correltion coefficient r, with df= N The test for the difference between two rs, ccomplished s norml curve test through the Fisher z trnsformtion of r (tbled in mny sttisticl texts). 4. The binomil distribution or, for lrge smples, the norml curve (or equivlent chisqure, 1 df) test tht popultion proportion (P) =. This test is lso used in the nonprmetric sign test for differences between pired observtions. 5. The norml curve test for the difference between two proportions, ccomplished through the rcsine trnsformtion <t> (tbled in mny sttisticl texts). The results re effectively the sme when the test is mde using the chisqure test with 1 degree of freedom. 6. The chisqure test for goodness of fit (one wy) or ssocition in twowy contingency tbles. In Tble 1, k is the number of cells nd P Qi nd P v re the null hypotheticl nd lternte hypotheticl popultion proportions in cell /. (Note tht w's structure is the sme s chisqure's for cell smple frequencies.) For goodnessoffit tests, the df= k  1, nd for contingency tbles, df= ( 1) (b 1), where nd b re the number of levels in the two vribles. Tble 2 provides (totl) smple sizes for 1 through 6 degrees of freedom. 7. Onewy nlysis of vrince. Assuming equl smple sizes (s we do throughout), for g groups, the Ftest hs df= g 1, g(n  1). The ES index is the stndrd devition of the g popultion mens divided by the common withinpopultion stndrd devition. Provision is mde in Tble 2 for 2 through 7 groups. 8. Multiple nd multiple prtil correltion. For k vribles, the significnce test is the stndrd F test for df= k,n k\. The ES index, /*, is defined for either squred multiple or squred multiple prtil correltions (R 2 ). Tble 2 provides for 2 through 8 vribles. Note tht becuse ll tests of popultion prmeters tht cn be either positive or negtive (Tests 15) re twosided, their ES indexes here re bsolute vlues. In using the mteril tht follows, keep in mind tht the ES posited by the investigtor is wht he or she believes holds for the popultion nd tht the smple size tht is found is conditionl on the ES. Thus, if study is plnned in which the investigtor believes tht popultion r is of medium size (ES = r  from Tble 1) nd the / test is to be performed with twosided =.05, then the power of this test is.80 if the smple size is 85 (from Tble 2). If, using 85 cses, t is not significnt, then
4 158 JACOB COHEN Tble 2 T for ll, ium, nd Lrge ES t Power =.80 for =.01,.05, nd Test 1. Men dif 2. Sigr 3. rdif 4. P=.5 5. Pdif 6. x 2 \df 2df Idf 4df 5df 6df 7. ANOA 2g" lg 5«* 6S" 8. Mult/? 2fc* 3/c* 4^ 5 * 6/c* Ik" 8/t* 586 1,1 2,3 1, ,168,8,546,5,787, , , ,090 1,194 1,293 1, , ,0 1,113 Note. ES = popultion effect size, = smll, = medium, = lrge, diff = difference, ANOA = nlysis of vrince. Tests numbered s in Tble 1. " Number of groups. * Number of vribles either r is smller then or the investigtor hs been the victim of the. (ft) risk of mking Type II error. Exmples The necessry N for power of.80 for the following exmples re found in Tble To detect medium difference between two smple mens (d= in Tble 1) t =.05 requires N= 64 in ech group. (A dof is equivlent to pointbiseril correltion of.243; see SPABS, pp ) 2. For significnce test of smple rl =.01, when the popultion r is lrge ( in Tble 2), smple size = 41 is required. At =.05, the necessry smple size = To detect mediumsized difference between two popultion rs (q = in Tble 1) t =.05 requires N = 177 in ech group. (The following pirs of rs yield q = :.00,.29;.,.46;.40,.62;.60,.76;.80,.89;.90,.94; see SPABS, pp ) 4. The sign test tests the HO tht of popultion of pired differences re positive. If the popultion proportion^ deprture from is medium (q =.15 in Tble 1), t =, the necessry N= ; t =.05, it is To detect smll difference between two popultion proportions (h =. in Tble 1) t =.05 requires T = 2 cses in ech group. (The following pirs of Ps yield pproximte vlues of h =.:.05, ;.,.29;.40, ;.60,.70;.80,.87;.90,.95; see SPABS, p. 184f.) 6. A 3 X 4 contingency tble hs 6 degrees of freedom. To detect medium degree of ssocition in the popultion (w = in Tble 1) t =.05 requires N = 151. (w = corresponds to contingency coefficient of.287, nd for 6 degrees of freedom, Crmer < of.212; see SPABS, pp ). 7. A psychologist considers lternte reserch plns involving comprisons of the mens of either three or four groups in both of which she believes tht the ES is medium (/=. in Tble 1). She finds tht t =.05, the necessry smple size per group is 52 cses for the threegroup pln nd cses for the fourgroup pln, thus, totl smple sizes of 156 nd 180. (When /=., the proportion of vrince ccounted for by group membership is.0588; see SPABS, pp ) 8. A psychologist plns reserch in which he will do multiple regression/correltion nlysis nd perform ll the significnce tests t =.01. For the F test of the multiple R 2, he expects medium ES, tht is, f 2 =. 15 (from Tble 1). He hs cndidte set of eight vribles for which Tble 2 indictes tht the required smple size is 147, which exceeds his resources. However, from his knowledge of the reserch re, he believes tht the informtion in the eight vribles cn be
References Cohen, J. (1962). The sttisticl power of bnormlsocil psychologicl reserch: A review. Journl of Abnorml nd Socil Psychology, 65, Cohen, J. (1965). Some sttisticl issues in psychologicl reserch. In B. B. Wolmn (Ed.), Hndbook of clinicl psychology (pp ). New York: McGrwHill. Cohen, J. (1969). Sttisticl power nlysis for the behviorl sciences. Sn Diego, CA: Acdemic Press. Cohen, J. (1988). Sttisticl power nlysis for the behviorl sciences (2nd ed.). Hillsdle, NJ: Erlbum. Cohen, J. (1990). Things I hve lerned (so fr). Americn Psychologist,, Goldstein, R. (1989). Power nd smple size vi MS/PCDOS computers. Americn Sttisticin, 43, 30. Neymn, 1, & Person, E. S. (1928). On the use nd interprettion of certin test criteri for purposes of sttisticl inference. Biometrik, A, , Neymn, J., & Person, E. S. (1933). On the problem of the most efficient tests of sttisticl hypotheses. Trnsctions of the Royl Society of London Series A, 231, Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of sttisticl power hve n effect on the power of studies? Psychologicl Bulletin, 105, Received Februry 1,1991 Revision received April,1991 Accepted My 2,1991
More information