The Unicorn, The Normal Curve, and Other Improbable Creatures

Transcription

1 Psychological Bulleti 1989, Vol No.1, The Uicor, The Normal Curve, ad Other Improbable Creatures Theodore Micceri 1 Departmet of Educatioal Leadership Uiversity of South Florida A ivestigatio of the distributioal characteristics of 440 large-sample achievemet ad psychometric measures foud all to be sigificatly oormal at the alpha.01 sigificace level. Several classes of cotamiatio were foud, icludig tail weights from the uiform to the double expoetial, expoetial-level asymmetry, severe digit prefereces, multimodalities, ad modes exteral to the mea/media iterval. Thus, the uderlyig teets of ormality-assumig statistics appear fallacious for these commoly used types of data. However, fidigs here also fail to support the types of distributios used i most prior robustess research suggestig the failure of such statistics uder oormal coditios. A reevaluatio of the statistical robustess literature appears appropriate i light of these fidigs. 1 Durig recet years a cosiderable literature devoted to robust statistics has appeared. This research reflects a growig cocer amog statisticias regardig the robustess, or isesitivity, of parametric statistics to violatios of their uderlyig assumptios. Recet fidigs suggest that the most commoly used of these statistics exhibit varyig degrees of orobustess to certai violatios of the ormality assumptio. Although the importace of such fidigs is uderscored by umerous empirical studies documetig oormality i a variety of fields, a startlig lack of such evidece exists for achievemet tests ad psychometric measures. A aive assumptio of ormality appears to characterize research ivolvig these discrete, bouded, measures. I fact, some coted that give the developmetal process used to produce such measures, a bell shaped distributio is guarateed (Walberg, Strykowski, Rovai, & Hug, 1984, p. 107). This iquiry sought to ed the tedious argumets regardig the prevalece of ormal-like distributios by surveyig a large umber of real-world achievemet ad psychometric distributios to determie what distributioal characteristics actually occur. 2 Widespread belief i ormality evolved quite aturally withi the domiat reductioist religio-philosophy of the 19th cetury. Early statistical researchers such as Gauss sought some measure to estimate the ceter of a sample. Hampel (1973) stated, Gauss... itroduced the ormal distributio to suit the arithmetic mea... ad... developed his statistical theories maily uder the criterio of mathematical simplicity ad elegace. (p. 94) 1. The author holds a joit appoitmet with the Departmet of Educatioal Leadership, College of Educatio, Uiversity of South Florida, ad with the Assistat Dea s Office, College of Egieerig, Ceter for Iteractive Techologies, Applicatios, ad Research. More complete tables are available from the author for postage ad hadlig costs. Correspodece cocerig this article should be addressed to Theodore Micceri, Departmet of Educatioal Leadership, Uiversity of South Florida, FAO 296, Tampa, Florida

2 3 Certai later scietists, seduced by such elegace, may have spet too much time seekig worldly maifestatios of God: I kow of scarcely aythig so apt to impress the imagiatio as the woderful form of cosmic order expressed by the Law of Frequecy of Error. The law would have bee persoified by the Greeks ad deified, if they had kow of it. It reigs with sereity ad i complete self-effacemet amidst the wildest cofusio. (Galto, 1889, p. 66) 4 Although Galto himself recogized the precedig to hold oly for homogeeous populatios (Stigler, 1986), such attributios to deity cotiue to appear i educatioal ad psychological statistics texts: It is a fortuate coicidece that the measuremets of may variables i all disciplies have distributios that are good approximatios of the ormal distributio. Stated differetly, God loves the ormal curve! (Hopkis & Glass, 1978, p. 95) 5 Toward the ed of the 19th cetury, biometricias such as Karl Pearso (1895) raised questios about the prevalece of ormality amog real-world distributios. Distrust of ormality icreased shortly thereafter whe Gosset s (Studet, 1908) developmet of the t test, with its strog assumptios, made statisticias of that time almost over-coscious of uiversal o-ormality (Geary. 1947, p. 241). Durig the 1920s, however, a importat chage of attitude occurred followig o the brilliat work of R. A. Fisher who showed that, whe uiversal ormality could be assumed, ifereces of the widest practical usefuless could be draw from samples of ay size. Prejudice i favour of ormality retured i full force... ad the importace of the uderlyig assumptios was almost forgotte. (Geary, 1947, p. 241) 6 The precedig illustrates both treds i attitudes toward ormality ad the ifluece of R. A. Fisher o 20th-cetury scietists. Today s literature suggests a tred toward distrust of ormality; however, this attitude frequetly bypasses psychometricias ad educators. Iterestigly, the characteristics of their measures provide little support for the expectatio of ormality because they cosist of a umber of discrete data poits ad [page 157] because their distributios are almost exclusively multiomial i ature. For multiomial distributios, each possible score (sample poit) is itself a variable, ad correlatios may exist amog each variable score/sample poit. Thus, a extremely large umber of possible cumulative distributio fuctios (cdfs) exist for such distributios defied by the probability of the occurrece for each score/sample poit (Hastigs & Peacock, 1975, p. 90). The expectatio that a sigle cdf (i.e., Gaussia) characterizes most score distributios for such measures appears ureasoable for several reasos. Nually (1978, p. 160) idetifies a obvious oe; Strictly speakig, test scores are seldom ormally distributed. The items of a test must correlate positively with oe aother for the measuremet method to make sese. Average correlatios as high as.40 would ted to produce a distributio that was markedly flatter tha the ormal (Nually, 1978, p. 160). Other factors that might cotribute to a o-gaussia error distributio i the populatio of iterest iclude but are ot limited to (a) the existece of udefied subpopulatios withi a target populatio havig differet abilities or attitudes, (b) ceilig or floor effects, (c) variability i the difficulty of items withi a measure, ad (d) treatmet effects that chage ot oly the locatio parameter ad variability but.also the shape of a distributio. 7 Of course, this issue is uimportat if statistics are truly robust; however, cosiderable research suggests that parametric statistics frequetly exhibit either relative or absolute orobustess i the presece of certai oormal distributios. The arithmetic mea has ot prove relatively robust i a variety of situatios; Adrews et al. (1972), Asell (1973), Gastwirth ad Rubi (1975), Wegma ad Carroll (1977), Stigler (1977), David ad Shu (1978), ad Hill ad Dixo (1982). The stadard deviatio, as a estimate of scale, proves relatively iefficiet give oly 18/100 of 1% cotamiatio (Hampel, 1973). Others who foud the stadard deviatio relatively orobust iclude Tukey ad McLaughli (1963), Waier ad Thisse (1976), ad Hettmasperger ad McKea (1978). Kowalski (1972) recommeds agaist usig 2

3 the Pearso product momet coefficiet uless (X, Y) is very early ormal because of both orobustess ad iterpretability. Waier ad Thisse (1976) coted that othig would be lost by immediately switchig to a robust alterative, r t. 8 A large, complex literature o the robustess of parametric iferetial procedures suggests that with the exceptio of the oe-mea t or z tests ad the radom-effects aalysis of variace (ANOVA), parametric statistics exhibit robustess or coservatism with regard to alpha i a variety of oormal coditios give large ad equal sample sizes. Disagreemet exists regardig the meaig of large i this cotext (Bradley, 1980). Also, several reviews suggest that whe s are uequal or samples are small, this robustess disappears i varyig situatios (Blair, 1981; Ito, 1980; Ta, 1982). I additio, robustess of efficiecy (power or beta) studies suggest that competitive tests such as the Wilcoxo rak-sum exhibit cosiderable power advatages while retaiig equivalet robustess of alpha i a variety of situatios (Blair, 1981; Ta, 1982). 9 Although far from coclusive, the precedig idicate that ormality-assumig statistics may be relatively orobust i the presece of o-gaussia distributios. I additio, ay umber of works assertig the oormality of specific distributios ad thereby the possible imprecisio of statistical procedures depedet o this assumptio may be cited (Allport, 1934; Adrews et al., 1972; Bradley, 1977, 1982; Hampel, 1973; E. S. Pearso & Please, 1975; K. Pearso, 1895; Simo, 1955; Stigler, 1977; Ta, 1982; Tapia & Thompso, 1978; Tukey & McLaughli, 1963; Wilso & Hilferty, 1929). Despite this, the ormality assumptio cotiues to permeate both textbooks ad the research literature of the social ad behavioral scieces. 10 The implicatios of the precedig discussio are difficult to assess because little of the oted robustess research deals with real-world data. The complexity ad lack of availability of real-world data compels may researchers to simplify questios by retreatig ito either asymptotic theory or Mote Carlo ivestigatios of iterestig mathematical fuctios. The emiet statistical historia Stephe Stigler (l977), ivestigatig 18th-cetury empirical distributios, coteded, the preset study may be the first evaluatio of moder robust estimators to rely o real data (p. 1070). Those few researchers veturesome eough to deal with real data (Hill & Dixo, 1982; Stigler, 1977; Tapia & Thompso, 1978) report fidigs that may call much of the above-cited robustess literature ito questio; (a) Real data evidece differet characteristics tha do simulated data; (b) statistics exhibit differet properties uder real-world coditios tha they do i simulated eviromets; ad (c) causal elemets for parametric orobustess ted to differ from those suggested by theoretical ad simulated research. 11 I a attempt to provide a empirical base from which robustess studies may be related to the real world ad about which statistical developmet may evolve, the curret iquiry surveyed specific empirical distributios geerated i applied settigs to determie which, if ay, distributioal characteristics typify such measures. This research was limited to measures geerally avoided i the past, that is, those based o huma resposes to questios either testig kowledge (ability/achievemet) or ivetoryig perceptios ad opiios (psychometric). 12 The obvious approach to classifyig distributios, à la K. Pearso (1895), Simo (1955), Taillie, Patil, ad Baldessari (1981), ad Law ad Vicet (1983), is to defie fuctioals characterizig actual score distributios. Ufortuately, this approach cofrots problems whe faced with the itractable data of empiricism. Tapia ad Thompso (1978) i their discussio of the Pearso system of curves coted that eve after goig through the streuous process of determiig which of the six Pearso curves a distributio appears to fit, oe caot be sure either that the chose curve is correct or that the distributio itself is actually a member of the Pearso family. They suggest that oe might just as well estimate the desity fuctio itself. Such a task, although feasible, is both complex ad ucertai. Problems of idetifiability exist for mixed distributios (Blischke, 1978; Quadt & Ramsey, 1978; Taillie et al., 1981), i which the 3

4 specificatio of differet parameter values ca result i idetical mixed distributios, eve for mathematically tractable two-parameter distributios such as the Gaussia. Kempthore (1978) argues that almost all distributioal problems are isoluble with a discrete sample space, otwithstadig the fact that elemetary texts are replete with fiite space problems that are soluble. (p. 12) 13 [page 158] No attempt is made here to solve the isoluble. Rather, this iquiry attempted, as suggested by Stigler (1977), to determie the degree ad frequecy with which various forms of cotamiatio (e.g., heavy tails or extreme asymmetry) occur amog real data. Eve the comparatively simple process of classifyig empirical distributios usig oly symmetry ad tail weight has pitfalls. Elashoff ad Elashoff (1978), discussig estimates of tail weight, ote that o sigle parameter ca summarize the varied meaigs of tail legth (p. 231). The same is true for symmetry or the lack of it (Gastwirth, 1971; Hill & Dixo, 1982). Therefore, multiple measures of both tail weight ad asymmetry were used to classify distributios. 14 As robust measures of tail weight, Q statistics (ratios of outer meas) ad C statistics (ratios of outer percetile poits) receive support. Hill ad Dixo (1982), Elashoff ad Elashoff(1 978), Wegma ad Carroll (1977), ad Hogg (1974) discuss the Q statistics, ad Wilso ad Hilferty (1929), Mosteller ad Tukey (1978), ad Elashoff ad Elashoff (1978) discuss the C statistics. 15 As a robust measure of asymmetry, Hill ad Dixo (1982) recommed Hogg s (1974) Q 2. However, Q 2 depeds o cotamiatio i the tails of distributios ad is ot sesitive to asymmetry occurrig oly betwee the 75th ad 95th percetiles. A alterative suggested by Gastwirth (1971) is a stadardized value of the populatio mea/media iterval. I the symmetric case, as sample size icreases, the statistic should approach zero. I the asymmetric case, as sample size icreases, the statistic will ted to coverge toward a value idicatig the degree of asymmetry i a distributio. Method 16 Two problems i obtaiig a reasoably represetative sample of psychometric ad achievemet/ability measures are (a) lack of availability ad (b) small sample sizes. Samples of 400 or greater were sought to provide reasoably stable estimates of distributioal characteristics. Distributios, by ecessity, were obtaied o a availability basis. Requests were made of 15 major test publishers, the Uiversity of South Florida s istitutioal research departmet, the Florida Departmet of Educatio, ad several Florida school districts for ability score distributios i excess of 400 cases. I additio, requests were set to the authors of every article citig the use of a ability or psychometric measure o more tha 400 idividuals betwee the years 1982 ad 1984 i Applied Psychology, Joural of Research i Persoality, Joural of Persoality, Joural of Persoality Assessmet, Multivariate Behavioral Research, Perceptual ad Motor Skills, Applied Psychological Measuremet, Joural of Experimetal Educatio, Joural of Educatioal Psychology, Joural of Educatioal Research, ad Persoel Psychology. A total of over 500 score distributios were obtaied, but because may were differet applicatios of the same measure, oly 440 were submitted to aalysis. 17 Four types of measures were sampled separately: geeral achievemet/ability tests, criterio/mastery tests, psychometric measures, ad, where available, gai scores (the differece betwee a pre- ad postmeasure). 18 For each distributio, three measures of symmetry/asymmetry were computed: (a) M/M itervals (Hill ad Dixo, 1982), defied as the mea/media iterval divided by a robust scale estimate( multiplied by oe-half the iterquartile rage), (b) skewess, ad (c) Hogg s (1974) Q 2, where Q 2 = [U(05) - M(25)] / [M(25) - L(05)] 4

5 where U(alpha)[M(alpha), U(alpha)] is the mea of the upper (middle, lower) [(N + 1)alpha] observatios. The iverse of this ratio defies Q 2 for the lower tail. 19 Two differet types of tail weight measure were also computed: (a) Hogg s (1974) Q ad Q 1, where Q = [U(05) L(05)] / [U(50)- L(50)] Q 1 = [U(20) L(20)] / [U(50)- L(50)] ad (b) C ratios of Elashoff ad Elashoff (1978): C 90, C 95, ad C 97.5 (the ratio of the 90th, 95th, ad 97.5th percetile poits, respectively, to the 75th percetile poit). 1 The Q statistics are sesitive to relative desity ad the C statistics to distace (betwee percetiles). Kurtosis, although computed, was ot used for classificatio because of iterpretability problems. 20 Criterio values of cotamiatio were determied for these measures usig tabled values for symmetric distributios (Elashoff& Elashoff, 1978) ad simulated values for asymmetric distributios. Table 1 shows five cut poits defiig six levels of tail weight (uiform to double expoetial) ad three cut poits defiig four levels of symmetry or asymmetry (relatively symmetric to expoetial). Table 1. Criterio Values for Measures of Tail Weight ad Symmetry Tail weight Symmetry/asymmetry Distributio C97.5 C95 C90 Q Q1 Skewess m/md Q2 Expected Values Uiform Gaussia Double expoetial Cut Poits Uiform Below Gaussia Moderate cotamiatio Extreme cotamiatio Double expoetial Cut poits were set arbitrarily, ad those defiig moderate cotamiatio of either tail weight or asymmetry were selected oly to idetify distributios as defiitely o-gaussia. The moderate cotamiatio cut poits (both symmetric ad asymmetric) were set at 5% ad 15% cotamiatio o the basis of the support for the alpha trimmed mea ad trimmed t i the research literature. Moderate cotamiatio (5%, 2 sd) represets at least twice the expected observatios more tha 2 stadard deviatios from the mea, ad extreme cotamiatio (15%, 3sd ) represets more tha 100 times the-expected observatios over 3 stadard deviatios from the mea. Distributios were placed i that category defied by their highest valued measure. 22 Two thousad replicatios of each classificatio statistic were computed to ivestigate samplig error for samples of size 500 ad 1,000 for simulated Gaussia, moderate, extreme, ad expoetial cotamiatios (Table 1) usig Iteratioal Mathematical ad Statistical Library subprograms GGUBS, GGNML, ad GGEXN. Oly slight differeces occurred betwee sample sizes 500 ad 1,000. Each statistic was at expectatio for the Gaussia (50% above ad 50% below cut). Results for asymmetric coditios idicate 1. Because score distributios did ot have a mea of zero, i order to compute percetile ratios it was ecessary to subtract the media from each of the relevat percetile poits ad use the absolute values of the ratios. 5

6 that cut poits for moderate cotamiatio uderestimate oormality, with 70.4% (skewess), 81.2% (Q 2 ), ad 72.2% (M/M) of the simulated statistics fallig below cut values at sample size 1,000. For extreme asymmetric cotamiatio, simulated values closely fit expectatios. However, for the expoetial distributio, skewess cut poits uderestimate cotamiatio (62% below cut), whereas those for Q 2 ad M/M overestimate cotamiatio (35% ad 43%, respectively, below cut) for sample size 1,000. Amog tail weight measures, the most variable estimate (C 97.5 ) showed cosiderable precisio for the most extreme distributio (expoetial), placig 45% of its simulated values below expected for sample size 1,000. This suggests that oe might expect some misclassificatios amog distributios ear the cut poits for moderate ad expoetial asymmetry, with relative precisio at other cut values. 23 Figure 1 shows a light-tailed, moderately asymmetric distributio as categorized by the precedig criteria. 24 Multimodality ad digit prefereces also preset idetifiability problems for distributios other tha the strict Gaussia. Therefore, arbitrary but coservative methods were used to defie these forms of cotamiatio. Two techiques, oe objective ad oe subjective, were used to idetify modality. First, histograms of all distributios were re- [page 159] viewed, ad those clearly exhibitig more tha a sigle mode were classified as such. Secod, durig computer aalysis, all sample poits occurrig with a frequecy at least 80% of that of the true mode (up to a maximum of five) were idetified, ad the absolute distace betwee adjacet modes was computed. Distaces greater tha two thirds (.667) of a distributio s stadard deviatio ware defied as bimodal. If more tha oe distace was this great, the distributio was defied as multimodal. I geeral, the two techiques coicided durig applicatio. Figure 1: A light-tailed, moderately asymmetric distributio ( = 3,152). 25 Digits were defied as preferred if they occurred at least 20 times ad if adjacet digits o both sides had fewer tha 70% or greater tha 130% as may cases. A digit preferece value was computed by multiplyig the umber of digits showig preferece by the iverse of the maximum percetage of preferece for each distributio. A digit preferece value exceedig 20 (at least four preferred digits with a maximum of 50% preferece) was defied as lumpy. I additio, perceived lumpiess was idetified. Figure 2 depicts a psychometric distributio that required a perceptual techique for classificatio as either lumpy or multimodal. This distributio cosists of at least two ad perhaps three fairly distict subpopu1atios. 6

7 Sample Results 26 Four hudred ad forty distributios were submitted to aalysis. Two hudred ad sixty-five of these distributios came from joural articles or researches of various types, 30 from atioal tests, 64 from statewide tests, ad 65 from districtwide tests. Sevetee distributios of college etrace ad Graduate Record Examiatio (GRE) scores came from the Uiversity of South Florida s admissio files. Figure 2: A asymmetric, lumpy, multimodal distributio ( = 1,258). 27 [page 160] The 231 ability distributios were derived from 20 differet test sources (e.g., Comprehesive Test of Basic Skills; CTBS) ad 45 differet populatios. The 125 psychometric distributios icluded 20 types of measures respoded to by 21 differet populatios. The 35 criterio measures were all part of the Florida State Assessmet Program (teacher ad studet), two test sources respoded to by 13 differet populatios. The 49 gai scores resulted from 5 test sources ad 10 differet populatios. 28 Amog ability measures, major sources icluded the Califoria Achievemet Tests, the Comprehesive Assessmet Program, the CTBS, the Staford Readig tests, tests produced by the Educatioal Testig Service for a begiig teacher study i Califoria, the Scholastic Aptitude Tests, the College Board subject area aptitude tests, the America College Test, the GRE, a series of tests produced by Sciece Research Associates, several aptitude-tests produced by Project Talet, the Hema Nelso IQ scores from the Wiscosi Logitudial Study of High School Seiors, the Performace Assessmet i Readig of McGraw- Hill, two scores produced by the Iteratioal Associatio for the Evaluatio of Educatioal Achievemet Studet Achievemet Study of , ad 15 tests represetig districtwide, teacher made, textbookproduced, ad composite scores created for specific studies. 29 Psychometric measures icluded: Miesota Multiphasic Persoality Ivetory scales; iterest ivetories; measures of ager, axiety curiosity, sociability, masculiity/femiiity, satisfactio, importace, usefuless, quality, ad locus of cotrol; ad two measures difficult to categorize, the Mallory test of visual halluciatios ad a measure of the degree to which oe s parter exerts force to obtai sex. 30 Criterio/mastery test results for studets i mathematics ad commuicatios skills at the 3rd, 5th, 8th, 10th, ad 11th grades were obtaied from the Florida State Assessmet Program. For adults, Florida Teacher Certificatio Examiatio distributios were obtaied for readig, writig, mathematics, adprofessioal educatio. 7

8 31 Sample sizes for the distributios were (10.8%), (19.8%), 1,000-4,999 (55.1%), ad 5,000 to 10,893 (14.3%). Approximately 90% of the distributios icluded 460 or more cases ad almost 70% icluded 1,000 or more. Subject areas for achievemet measures icluded laguage arts, quatitative arts/logic, scieces, social studies/history, ad skills such as study skills, grammar, ad puctuatio. Grade/age groupigs icluded 30.5% from grades K-6, 20% from grades 7-9, 18.4% from grades 10-12, 9% from college studets, ad 22% from adults. 32 Most distributios had sample spaces of betwee 10 ad 99 scale poits (83.3%). Fifty-five distributios (12.5%) had sample spaces of fewer tha 10 scale poits, ad 19 distributios (4.3%) had sample spaces greater tha 99 scale poits. Measures of Tail Weight ad Asymmetry 33 O the basis of the criteria i Table 1, Table 2 shows that 67 (15.2%) of the 440 distributios had both tails with weights at or about the Gaussia, 216 (49.1%) had at least oe extremely heavy tail, ad 79 (18%) had both tail weights less tha the Gaussia. Amog ability measures, the percetages were similar with 45 (19.5%) havig both tail weights at or about the Gaussia, 133 (57.6%) havig at least oe heavy tail, ad 53 (22.9%) havig both tails less tha the Gaussia. Amog psychometric measures, 17 (13.6%) had tail weights ear the Gaussia, 82 (65.6%) had at least oe moderately heavy tail, ad 26(20.8%) had both tail weights less tha the Gaussia. All criterio/mastery ad 45 (89.8%) of the gai score distributios exhibited at least oe tail weight greater tha that expected at the Gaussia. Five gai scores (l0.2%)had tail weights ear the Gaussia. 34 Table 3 shows that amog all measures, 125 of the distributios were classified as beig relatively symmetric (28.4%), ad 135 (30.7%) were classified as beig extremely asymmetric. Forty-seve percet of the gai score, 65.8% of the ability/achievemet measures, 84.0% of psychometric measures, ad 100% of criterio/mastery measures were at least moderately asymmetric. Criterio/mastery ad psychometric measures frequetly exhibited extreme to expoetial asymmetry, 94.3% ad 52.0%, respectively. Geeral ability measures teded to be less extreme (15.6% extremely or expoetially asymmetric). 35 Crossig the values for tail weight ad symmetry, Table 4 shows that 30 (6.8%) of the 440 distributios exhibit both tail weight ad symmetry approximatig that expected at the Gaussia ad that 21 (48%) exhibited relative symmetry ad tail weights lighter tha that expected at the Gaussia. Table 2. Categories of Tail Weight Across Types of Measures, % Level of symmetric cotamiatio Achievemet ( = 231) Psychometric ( = 125) Criterio mastery ( = 35) Gai score ( = 49) All types ( = 440) Uiform Less tha Gaussia About Gaussia Moderate Extreme Double expoetial Total

9 36 [page 161] Table 5 shows that results were similar for ability measures, with 23 (10.0%) at or about the Gaussia ad 20 (8.7%) exhibitig relative symmetry ad tail weights less tha that expected at the Gaussia. Table 3. Categories of Asymmetry Across Types of Measures, % Level of asymmetric cotamiatio Achievemet ( = 231) Psychometric ( = 125) Criterio mastery ( = 35) Gai score ( = 49) All types ( = 440) Relatively symmetric Moderate asymmetry Extreme asymmetry Expoetial asymmetry Total Table 6 shows that 4 psychometric distributios (3.2%) exhibited both relative symmetry ad tail weights ear the Gauss ia ad 39 distributios (3 1.2%) exhibited extreme- to expoetial-level tail weight combied with extreme- to expoetial-level asymmetry. 38 Table 7 shows that criterio/mastery measures teded to exhibit at least moderate asymmetry (100%) ad at least oe tail weight at either the extreme or expoetial level (91.4%). Twety (57.2%) of these distributios exhibited asymmetry at or above the expoetial. 39 Table 8 shows that gai scores were relatively symmetric to moderately asymmetric with moderate to heavy tail weights (81.6%). Four cases (8.2%) exhibited tail weight at or above the double expoetial, ad five (10.2%) were at or about the Gaussia. Two distributios (4.1%) exhibited asymmetry greater tha the moderate level. 40 Although ot used as a classificatio measure, kurtosis estimates were computed ad raged from to Niety-seve percet (35136) of those distributios exhibitig kurtosis beyod the double expoetial (3.00) also showed extreme or expoetial asymmetry ad were frequetly characterized by sample spaces of greater tha 25 scale poits. Almost all distributios havig low (egative) kurtoses were at most moderately asymmetric ad frequetly had small sample spaces. The fourth-momet kurtosis estimate for these distributios correlated r =.78 with the third-momet skewess estimate. Modality ad Digit Prefereces 41 Three hudred ad twelve (70.9%) distributios were classified as uimodal, 89 (20.2%) as bimodal, ad 39 (8.9%) as multimodal. Two hudred ad eightee distributios (49.5%) were defied as relatively smooth ad 222 (50.5%) as lumpy. The smoothest distributios were criterio/mastery measures (89%) ad gai scores (73%). Psychometric measures teded to be lumpy (6 1.6%), as did geeral ability measures (54.3%). 1. These are adjusted values at which the expected value at the Gaussia is 0.00 rather tha

10 Testig for Normality 42 The Kolmogorov-Smirov test of ormality (SAS Istitute, 1985) foud 100% of the distributios to be sigificatly oormal at the.01 alpha level. However, 16 ability measures (6.9%) ad 3 gai scores (6.1%) were foud to be relatively symmetric, smooth, ad uimodal ad to have tail weights ear those expected at the Gaussia. These 19 distributios (4.3%) may be cosidered quite reasoable approximatios to the Gaussia. No psychometric measures ad o criterio/mastery measures were icluded amog these 19 distributios. Sample spaces raged from 7 to 135 ad sample sizes from 346 to 8,092. Discussio 43 Although ot draw radomly, the 440 distributios comig from some 46 differet test sources ad 89 differet populatios should iclude most types of distributios occurrig i applied settigs for these measures. Sice 60% of all distributios result directly from research ad aother 33% from state, district, or uiversity scorig programs, they should also represet distributios directly relevat to research, theory developmet, ad decisio makig. 44 Walberg et al. (1984), o the basis of a impressive literature review, coclude that asymmetry ad extremes lyig several stadard deviatios above the mai distributio body occur commoly where measures are less restrictive i rage tha the typical achievemet ad attitude scale (p. 107). The curret iquiry shows that eve amog the bouded measures of psychometry ad achievemet, extremes of asymmetry ad lumpiess are more the rule tha the exceptio. No distributios amog those ivestigated passed all tests of ormality, ad very few seem to be eve reasoably close approximatios to the Gaussia. It therefore appears meaigless to test either ability or psychometric distributios for ormality, because oly weak tests or chace occurreces should retur a coclusio of ormality. Istead, oe should probably heed Geary s (1947) caveat ad preted that ormality is a myth; there ever was, ad ever will be, a ormal distributio (p. 241). 45 The implicatios of this for may commoly applied statistics are uclear because few robustess studies, either empirical or theoretical, have dealt with lumpiess or multimodality. These fidigs suggest the eed for careful data scrutiy prior to aalysis, for purposes of both selectig statistics ad iterpretig results. Adequate research is available to suggest that most parametric statistics should be fairly robust to both alpha ad beta give light tail weights ad moderate cotamiatios. For extreme to expoetial asymmetry (52.0% of psychometric measures), oe might expect at least the idepedet meas t (give approximately equal s) ad F to exhibit robustess to alpha, if ot beta. However, uder such coditios, differeces betwee medias may well be a more iterestig research questio tha mea shift for studies seekig iformatio about the middle rather tha the tails of a distributio (Wilcox & Charli, 1986). 46 Normalizig trasformatios are frequetly applied to sus- [page 162]pected departures from symmetry. These, however, should be used with cautio, because of problems such as selectio ad iterpretability. For istace, as E. S. Pearso ad Please (1975) ote regardig log trasformatios, There are also pitfalls i iterpretig the aalysis, if oly because the atilog of the mea value of log x is ot the mea of x (p. 239). O this topic, see also Taylor (1985), Games (1984), Hill ad Dixo (1982), Bickel ad Doksum (1981), Carroll (1979), ad Mosteller ad Tukey (1978). 10

11 Table 4. Tail Weight ad Asymmetry for All Distributios Values of asymmetry Total Values of tail weight Near symmetry Moderate Extreme Expoetial N Percetage Uiform Less tha Gaussia Near Gaussia Moderate cotamiatio Extreme cotamiatio Double expoetial Total Percetage A attempt was made to characterize easily discerable [sic] groups of distributios. Patters occurred cosistetly for two measures: (a) Gai scores teded to be fairly symmetric (either symmetric or moderately asymmetric) ad to have moderate to heavy tails (85.7% of gai score distributios); (b) criterio/ mastery tests teded to be extremely asymmetric (94.3%), with at least oe heavy tail (9 1.4%). Fully 85.7% of the criterio/ mastery distributios have at least oe heavy tail combied with extreme asymmetry. 48 It proved impossible, however, to typify either geeral ability/achievemet or psychometric measures, both of which teded to distribute throughout the symmetry/tail weight matrix (Tables 5 ad 6), while exhibitig varyig modalities ad digit prefereces. Psychometric measures exhibited greater asymmetry (84% were at least moderately asymmetric) ad heavier tails (65.6% had at least oe moderately heavy tail) tha did ability measures. 49 Table 5 suggests that geeral ability measures ted to exhibit less extreme cotamiatio tha do the other measures. Noe had tail weights at or ear the uiform, ad oly 3.0% exhibited asymmetry at or above that expected for the expoetial. However, eve if oe treats all moderately cotamiated cells of Table 5 as reasoable approximatios to ormality, oly 132 geeral ability distributios (57.1%) would qualify for the title Table 4 shows that most cells of the tail weight/asymmetry matrix are filled ad that couts i each cell ted to remai fairly costat as oe moves from light tails to heavy tails or from relative symmetry to extreme asymmetry. Table 4 also shows the poor match betwee real data ad the smooth mathematical fuctios geerally applied i Mote Carlo robustess studies. Distributios exhibitig either extremely heavy tail weights (expoetial) or extremely light tail weights (uiform) ted also to be asymmetric. This suggests that simulated studies based o such symmetric mathematical fuctios as the uiform, logistic, double expoetial, Cauchy, ad t with few degrees of freedom may ot represet real-world data to ay reasoable extet. 1. Recall that moderate cotamiatio represets at least twice the expected cases more tha 2 stadard deviatios from the mea ad ot more tha 100 times the expected cases more tha 3 stadard deviatios from the mea. 11

12 51 The distributios studied here exhibited almost every coceivable type of cotamiatio, icludig (a) broad classes of tail weight (uiform to double expoetial), (b) broad classes of symmetry (quite symmetric to asymmetry greater tha that of the expoetial), (c) varyig modalities (uimodal, bimodal, multimodal), (d) varyig types of lumpiess/digit preferece, ad (e) modes exteral to the mea/ media iterval. Also, all ratios of a robust scale estimate to the stadard deviatio were greater tha the 1.00 expected at the ormal. This idicates that all distributios exhibit at least some asymmetry (Messick, 1982; K. Pearso, 1895). 52 The great variety of shapes ad forms suggests that respodet samples themselves cosist of a variety of extremely heterogeeous subgroups, varyig withi populatios o differet yet similar traits that ifluece scores for specific measures. Whe this is cosidered i additio to the expected depedecy iheret i such measures, it is somewhat uervig to eve dare thik that the distributios studied here may ot represet most of the distributio types to be foud amog the true populatios of ability ad psychometric measures. 53 Oe might expect treatmet effects to create lumpiess, subgroupigs, or bi/multimodalities such as those ecoutered i these data. Although a likely effect, it does ot ifluece these results because the large sample requiremet essetially elimiated postmeasures from experimetal studies. I those situatios i which both pre- ad postmeasures were available, almost every case exhibitig lumpiess or bi/multimodality i the postmeasure showed similar characteristics i the premeasure. Figure 3 depicts a iterestig ad fairly commo example of this with a iterveig treatmet. This premeasure, classified as either bimodal or lumpy, appears to iclude two [page 163] subgroups. Oe is familiar with the material, approaches the test ceilig ad rages about The secod is ufamiliar with the material ad distributes aroud 4-7. The uimodal ature of the postmeasure suggests that treatmet (a 6-week geeral biology course) largely elimiated the latter group. 54 To assure that distributios were as homogeeous as possible, all distributios havig idetified subpopulatios that were expected to differ o the measure (e.g., White/o-White, male/female) were separated, ad geerally oly oe was submitted to aalysis. That distributios still exhibited substatial lumpiess ad varyig modalities calls to mid the argumet Courot proposed i 1843 that probability is irrelevat to statistics i the social scieces because a ulimited umber of ways of classifyig social data existed ad ay probability aalysis that did ot allow for the selectio of categories after the collectio of data was, i a practical sese, meaig1ess. (Stigler, 1986, P. 197) Table 5. Tail Weight ad Asymmetry for Ability Distributios Values of asymmetry Total Values of tail weight Near symmetry Moderate Extreme Expoetial N Percetage Uiform Less tha Gaussia Near Gaussia Moderate cotamiatio Extreme cotamiatio Double expoetial Total Percetage

13 55 The use of multiple classificatio measures produced some iterestig fidigs. As with simulated expoetial distributios, Q 2 uiquely defied more real-world distributios as beyod the expoetial (eight) tha did either skewess (six) or M/M (four). Q statistics for tail weight (Q, Q 1,) rarely reached the highest classificatio value largely because of the prevalece of asymmetry. For C statistics, egative tails were more frequetly defied as o-gaussia tha were positive oes. Also, cotamiatio occurred more frequetly i the closer tails (C 10 /C 90 ) tha i the farther tails (C 025 /C 975 ). This suggests that cotamiatio i the tails for these distributios is ot evely distributed, as oe might expect for bouded, lumpy populatios icludig udefied subgroups. 56 Some may coted that the use of fiite samples does ot disprove ormality, because as sample size icreases, score distributios are attracted to the ormal. This type of cofusio stems from the fallacious overgeeralizatio of cetral limit theorem properties from sample meas to idividual scores. The cetral limit theorem states that the sums (or meas) of sufficietly large samples from a populatio satisfyig the Lidberg coditios will have a approximately ormal distributio. It does ot state, however, that the populatio of scores from which these sample meas are draw is ormally distributed (Tapia & Thompso, 1978). 57 As was oted earlier, the implicatios these fidigs have for ormality-assumig statistics are uclear. Prior robustess studies have geerally limited themselves either to computatioal evaluatio of asymptotic theory or to Mote Carlo ivestigatios of iterestig mathematical fuctios. This research [page 164 ]has bee coducted almost exclusively usig smooth mathematical fuctios that have rather extreme tail weights or asymmetry. Such characteristics proved rare amog these real-world distributios. Because 50% of these distributios exhibited lumpiess ad about two thirds of ability ad over four fifths of psychometric measures exhibited at least moderate asymmetry, these appear to be importat areas for future study. Table 6. Tail Weight ad Asymmetry for Psychometric Distributios Values of asymmetry Total Values of tail weight Near symmetry Moderate Extreme Expoetial N Percetage Uiform Less tha Gaussia Near Gaussia Moderate cotamiatio Extreme cotamiatio Double expoetial Total Percetage Iterestigly, i Adrews et al. (1972, p. 109) there is a small sectio etitled Asymmetric Situatios begiig with the cautio, Except i a few istaces there may be o reaso to believe the uderlyig distributio is symmetric. Adrews et al. (1972) ivestigated the performace of 65 locatio estimators i the presece of simulated ormal populatios havig 10% asymmetric cotamiatio 2 ad 4 stadard deviatios from the populatio mea. For both situatios at all sample sizes, the arithmetic mea proved 13

14 the least variable (best) estimator. These authors, who cocluded that the arithmetic mea was the best choice as the worst estimator amog those ivestigated, fail to metio this fidig agai because about usymmetric situatios,... we were ot able to agree, either betwee or withi idividuals, as to the criteria to be used (Adrews et al., 1972, p. 226). Thus, the arithmetic mea proved most robust (least variable) uder asymmetry, the coditio foud to occur for most (71.6%) distributios ivestigated here. Table 7. Tail Weight ad Asymmetry for Criterio/Mastery Measures Values of asymmetry Total Values of tail weight Near symmetry Moderate Extreme Expoetial N Percetage Uiform Less tha Gaussia Near Gaussia Moderate cotamiatio Extreme cotamiatio Double expoetial Total Percetage Table 8. Tail Weight ad Asymmetry for Gai Scores Values of asymmetry Total Values of tail weight Near symmetry Moderate Extreme Expoetial N Percetage Uiform Less tha Gaussia Near Gaussia Moderate cotamiatio Extreme cotamiatio Double expoetial Total Percetage Factors such as these suggest the eed (a) to ivestigate the previous robustess research ad determie its appropriateess give the types of cotamiatio foud to exist i the real world ad (b) to suggest importat areas for the ivestigatio of the robustess of various statistics. 14

15 Figure 3: Pre- ad postmeasures i 10th grade geeral biology ( = 337). 60 As a example of the first suggestio, the oft-cited works of Boeau (1960, 1962) ad two prior studies dealig with small sample space situatios are superficially cosidered. Boeau (1960, 1962) compared the robustess of the Ma-Whitey/ Wilcoxo rak-sum test to that of the t test for samples of size (5, 5), (15, 15), ad (5, 15) i the presece of two smooth symmetric distributios (uiform ad ormal) ad oe smooth asymmetric distributio (expoetial). Amog distributios studied here, otwithstadig the fact that all of his distributios were cotiuous ad smooth, although half of these real-world data sets were lumpy ad all were discrete, oly 38 (8.6%) exhibited both expoetial-level tail weight ad asymmetry (largely criterio/mastery measures, = 20), oe exhibited symmetric, uiform (rectagular) tail weights, ad oly 19 (4.3%) ca be cosidered eve reasoable approximatios to the Gaussia (ormal). This does ot ivalidate his fidigs but does suggest that almost oe of these comparisos occurs i real life. The most obvious differeces betwee Boeau s data ad that of the real world are lumpiess ad discreteess. Two [page 165] prior studies deal with distributios exhibitig such characteristics i the limited area of small sample spaces. Hsu ad Feldt (1969) foud the F to exhibit robustess to alpha for populatios with from 3 to 6 scale poits (sample space). However, the maximum thirdmomet skewess icluded amog their populatios was.39, ad i the curret study, amog the 43 distributios havig sample spaces betwee 3 ad 6, 72.7% exhibited either positive or egative skew greater 15

16 tha.39. Thus, at least oe importat distributioal characteristic suggests that the fidigs of Hsu ad Feldt may ot geeralize to the real world of small sample spaces. 61 I a recet study by Gregoire ad Driver (1987), the authors ivestigated several statistics i the presece of 12 varied populatios havig sample spaces of four or five. Amog the 18 distributios i the curret study havig sample spaces of five or less, 7 (38.8%) exhibited skewess at or greater tha.94. Oly oe populatio studied by Gregoire ad Driver (1987) exhibited asymmetry at or about that level (0.99), ad it proved to be oe of the worst populatios i their article. Specifically, for populatio IIC, the two-sample parametric cofidece iterval teded to be coservative to alpha (supportig almost all prior research usig equal s). The populatio mea was outside the.05 cofidece iterval about the sample mea 75% of the time for samples of size 25. The F test for homogeeity of variace was operatig at a obtaied alpha of about.21 whe omial alpha was.05. Ad fially, the KS two-sample test was extremely coservative, havig a obtaied alpha of about.01 whe omial alpha was.05. Ufortuately, this populatio was ot icluded i their discussio of power. However, from their Table 6, it is iterestig to ote that the oly compariso betwee two-sample tests i which a substatial power advatage accrues to ay test is that betwee populatios IIIA ad IA (uiform). I that situatio, the va der Waerde test exhibited a cosiderable power advatage at sample size 10 over both the parametric cofidece iterval ad the Ma-Whitey/Wilcoxo tests. The curret study suggests that this specific situatio may ever arise i practice, because oe of the 440 distributios ivestigated here exhibited both relative symmetry ad uiform level tail weights. However, 53 (22.9%) of ability/achievemet distributios ad 26 (20.3%) of psychometric distributios did have both tails lighter tha the Gaussia. 62 Overall, oe must coclude that the robustess literature is at best idicative, for at least two reasos: (a) Few prior studies deal with commoly occurrig characteristics such as lumpiess ad multimodalities, ad (b) i some circles (e.g., Adrews et al., 1972), bias agaist the fidig of robustess for parametric statistics may exist. 63 Oe disturbig fidig of this research was a geeral lack of data availability. Oly about 25% of the authors to whom requests were set reported the ability to produce simple frequecy distributios for data reported i their studies. May differet reasos for this iability were oted; however, o matter what the reasos, the situatio is somewhat disquietig. Refereces Allport, E M. (1934). The J-curve hypothesis of coformig behavior. Joural of Social Psychology, 5, Adrews, D. E, Bickel, P. J., Hampel, F. R., Huber, P. J., Rogers, W. H., & Tukey, J. W. (1972). Robust estimates of locatio survey ad advaces. Priceto, NJ: Priceto Uiversity Press. Asell, M. J. G. (1973). Robustess of locatio estimators to asymmetry. Applied Statistics, 22, Bickel, P. J., & Doksum, K. A. (1981). A aalysis of trasformatios revisited. Joural of the America Statistical Associatio, 76, Blair, R. C. (1981). A reactio to Cosequeces of failure to meet assumptios uderlyig the fixed effects aalysis of variace ad covariace. Review of Educatioal Research, 51, Blischke, W. R. (1978). Mixtures of distributios. I W. H. Kruskal ad J. M. Taur (Eds.), Iteratioal ecyclopedia of statistics (pp ). New York: Free Press. Boeau, C. A. (1960). The effects of violatios of assumptios uderlyig the t test. Psychological Bulleti, 57, Boeau, C. A. (1962). A compariso of the power of the U ad t tests. Psychological Review, 69, Bradley, J. W. (1977). A commo situatio coducive to bizarre distributio shapes. The America Statisticia, 31,

17 Bradley, J. W. (1980). Norobustess i z, t, ad F tests at large sample sizes. Bulleti of the Psychoomic Society, 16, Bradley, J. W. (1982). The-isidious L-shaped distributio. Bulleti of the Psychoomic Society, 20, Carroll, R. J. (1979). O estimatig variaces of robust estimators whe the errors are asymmetric. Joural of the America Statistical Associatio, 74, David, H. A., & Shu, V. S. (1978). Robustess of locatio estimators i the presece of a outlier. I H. A. David (Ed.), Cotributios to survey samplig ad applied statistics (pp ). New York: Academic Press. Elashoff, J. D., & Elashoff, R. M. (1978). Effects of errors i statistical assumptios. I W. H. Kruskal ad J. M. Taur (Eds.), Iteratioal ecyclopedia of statistics (pp ). New York: Free Press. Galto, F. (1889). Natural iheritece. Lodo: Macmilla. Games, P. A. (1984). Data trasformatios, power, ad skew: A rebuttal to Levie ad Dulap. Psychological Bulleti, 95, [sic] Gastwirth, J. L. (1971). O the sig test for symmetry. Joural of the America Statistical Associatio, 166, Gastwirth, J. L., & Rubi, H. (1975). The behavior of robust estimators o depedet data. The Aals of Statistics, 3, Geary, R. C. (1947). Testig for ormality. Biometrika, 34, Gregoire, T. G., & Driver, B. L. (1987). Aalysis of ordial data to detect populatio differeces. Psychological Bulleti, 101, Hampel, F. R. (1973). Robust estimatio: A codesed partial survey. Zeitschrzft fur Wahrscheilichkeitstheorie ud Verwadte Gebiete, 27, Hastigs, N. A. J., & Peacock, J. B. (1975). Statistical distributios: A hadbook for studets ad practitioers. New York: Wiley. Hettmasperger, T P., & McKea, J. W. (1978). Statistical iferece based o raks. Psychometrika, 43, Hill, M., & Dixo, W. J. (1982). Robustess i real life: A study of cliical laboratory data. Biometrics, 38, Hogg, R. V. (1974). Adaptive robust procedures: A partial review ad some suggestios for future applicatios ad theory. America Statistical Associatio Joural, 69, Hopkis, K. D., & Glass, G. V. (1978). Basic statistics for the behavioral scieces. Eglewood Cliffs, NJ: Pretice-Hall. Hsu, T., & Feldt, L. S. (1969). The effect of limitatios o the umber of criterio score values o the sigificace level of the F test. America Educatioal Research Joural, 6, Ito, P. K. (1980). Robustess of ANOVA ad MANOVA test procedures. I P. R. Krishaiah (Ed.), Hadbook of statistics (Vol. 6, pp ). Amsterdam: North-Hollad. Kempthore, O. (1978). Some aspects of statistics, samplig ad radomizatio. I H. A. David (Ed.), Cotributios to survey samplig ad applied statistics (pp ). New York: Academic Press. Kowaiski, C. L (1972). O the effects of o-ormality o the distributio of the sample product-momet correlatio coefficiet. Applied Statistics, 21, Law, A. M., Vicet, S. O. (1983). UNIFIT: A iteractive computer package for fittig probability distributios to observed data. Tucso, AZ: Simulatio Modelig ad Aalysis Compay. Messick, D. M. (1982). Some cheap tricks for makig ifereces about distributio shapes from variaces. Educatioal ad Psychological Measuremet, 42, Mosteller, F., & Tukey, J. W. (1978). Data aalysis ad regressio: A secod course i statistics. Bosto: Addiso-Wesley. Nually, J. C. (1978). Psychometric theory. New York: McGraw-Hill. Pearso, E. S., & Please, N. W. (1975). Relatio betwee the shape of populatio distributio ad the robustess of four simple test statistics. Biometrika, 62, Pearso, K. (1895). Cotributios to the mathematical theory of evolutio: II. Skew variatio i homogeeous material. Philosophical Trasactios of the Royal Society Ser. A, 186,

18 Quadt. R. E., & Ramsey, J. B. (1978). Estimatig mixtures of ormal distributios ad switchig regressios. America Statistical Associatio Joural, 73, SAS Istitute. (1985). SAS user s guide: Basics. Cary, NC: Author. Simo, H. A. (1955). O a class of skew distributio fuctios. Biometrika, 42, Stigler, S. M. (1977). Do robust estimators work with real data? The Aals of Statistics, 5, Stigler, S. M. (1986). The history of statistics: The measuremet of ucertaity before Cambridge, MA: Belkap Press. Studet. (1908). The probable error of a mea. Biometrika, 6, Taillie, C., Patil, G. P., & Baldessari, B. A. (1981). Statistical distributios i scietific work: Vol. 5. Iferetial problems ad properties. Bosto: D. Reidel. Ta, W. Y. (1982). Samplig distributios ad robustess oft, F ad variace-ratio i two samples ad ANOVA models with respect to departure from ormality. Commuicatios i Statistics. A11, Tapia, R. A., & Thompso, J. R. (1978). Noparametric probability desity estimatio. Baltimore, MD: Johs Hopkis Uiversity Press. Taylor, J. M. G. (1985). Measures of locatio of skew distributios obtaied through Box-Cox trasformatios. Joural of the America Statistical Associatio, 80, Tukey, J. W., & McLaughli, D. H. (1963). Less vulerable cofidece ad sigificace procedures for locatio based o a sigle sample: Trimmig/Wisorizatio. Idia Joural of Statistics, 25, Waier, H., & Thisse, D. (1976). Three steps toward robust regressio. Psychometrika, 41, Walberg, H. J., Strykowski, B. E, Rovai, E., & Hug, S. S. (1984). Exceptioal performace. Review of Educatioal Research, 54, Wegma, E. J., & Carroll, R. J. (1977). A Mote Carlo study of robust estimators of locatio. Commuicatios i Statistics, A6, 795-8l2. Wilcox, R. R., & Charli, V. L. (1986). Comparig medias: A Mote Carlo study. Joural of Educatioal Statistics, 11, Wilso, E. B., & Hilferty, M. M. (l929). Note o C. S. Peirce s experimetal discussio of the law of errors. Proceedigs of the Natioal Academy of Sciece, 15, Received September 14, 1987 Revisio received November 30, 1987 Accepted March 22,