LSAC RESEARCH REPORT SERIES The Probablty of Exceedance as a Nonarametrc Person-Ft Statstc for Tests of Moderate Length Jorge N. Tendero Rob R. Mejer Unversty of Gronngen, Gronngen, the Netherlands Law School Admsson Councl Research Reort 3-06 November 203 A Publcaton of the Law School Admsson Councl
The Law School Admsson Councl (LSAC) s a nonroft cororaton that rovdes unque, state-of-the-art roducts and servces to ease the admsson rocess for law schools and ther alcants worldwde. Currently, 28 law schools n the Unted States, Canada, and Australa are members of the Councl and beneft from LSAC's servces. All law schools aroved by the Amercan Bar Assocaton are LSAC members. Canadan law schools recognzed by a rovncal or terrtoral law socety or government agency are also members. Accredted law schools outsde of the Unted States and Canada are elgble for membersh at the dscreton of the LSAC Board of Trustees; Melbourne Law School, the Unversty of Melbourne s the frst LSAC-member law school outsde of North Amerca. Many nonmember schools also tae advantage of LSAC s servces. For all users, LSAC strves to rovde the hghest qualty of roducts, servces, and customer servce. Founded n 947, the Councl s best nown for admnsterng the Law School Admsson Test (LSAT ), wth about 00,000 tests admnstered annually at testng centers worldwde. LSAC also rocesses academc credentals for an average of 60,000 law school alcants annually, rovdes essental software and nformaton for admsson offces and alcants, conducts educatonal conferences for law school rofessonals and relaw advsors, sonsors and ublshes research, funds dversty and other outreach grant rograms, and ublshes LSAT rearaton boos and law school gudes, among many other servces. LSAC electronc alcatons account for 98 ercent of all alcatons to ABA-aroved law schools. 203 by Law School Admsson Councl, Inc. All rghts reserved. No art of ths wor, ncludng nformaton, data, or other ortons of the wor ublshed n electronc form, may be reroduced or transmtted n any form or by any means, electronc or mechancal, ncludng hotocoyng and recordng, or by any nformaton storage and retreval system, wthout ermsson of the ublsher. For nformaton, wrte: Communcatons, Law School Admsson Councl, 662 Penn Street, PO Box 40, Newtown, PA, 8940-0040. Ths study s ublshed and dstrbuted by LSAC.
Table of Contents Executve Summary... Introducton... A Nonarametrc Aroach to Person Ft... 2 The Probablty of Exceedance... 4 Smulaton Study... 7 Data Smulaton: Normal Resonse Patterns... 7 Data Smulaton: Aberrant Resonse Patterns... 8 Results: Model Varatons... 9 Results: Detecton Rate of the PE... Asymtotc Dstrbuton of the PE Statstc... 3 Dscusson... 3 References... 4 Aendx A... 8 Aendx B... 20
Executve Summary In ths reort we resent a measure to dentfy unlely atterns of correct/ncorrect answers to test questons (commonly referred to as tems). Some examles of why such atterns may occur nclude the msnterretaton of questons, tem renowledge, answer coyng, or guessng behavor. The roosed measure s the robablty of exceedance (PE). PE rovdes nformaton about the robablty of a correct/ncorrect answer attern, condtonal on the test taer s total score. Although ths concet s not new, t s hardly f ever aled n ractce. In ths reort we show how the PE of a resonse vector can be comuted and how msfttng resonse atterns are detected. A smulaton study s conducted to nvestgate the robustness of ths rocedure. Introducton In sychologcal and educatonal testng, checng the valdty of ndvdual test scores s an mortant element n the assessment rocedure. One way to do ths s to nvestgate the consstency of observed tem scores wth the robablty exected under an tem resonse theory (IRT; Embretson & Rese, 2000) model. Analyzng how well ndvdual resonse vectors ft an IRT model s nown as erson-ft research or arorateness measurement (e.g., Mejer & Sjtsma, 200). The mortance of erson-ft research s, for examle, recognzed n the gudelnes of the Internatonal Test Commsson (20), whch recommended checng unexected resonse atterns as one of many tools to determne test score valdty (. 7). Also, several testng organzatons have consdered mlementng methods for ndvdual test score and tem score valdty. For examle, at Educatonal Testng Servce, the TOEFL rogram has already mlemented qualty control charts for ts quarterly revews as reemtve checs (A. von Daver, ersonal communcaton, July 2, 202). Another examle of the usefulness of checng the consstency of ndvdual tem scores was gven n Mejer, Egbern, Emons, and Sjtsma (2008), who dentfed schoolchldren who dd not understand the hrasng of many questons of a questonnare on self-concet. Furthermore, Ferrando (202) dscussed the use of erson-ft research to screen for dosyncratc answerng behavor and low erson-relablty for students fllng out a Neurotcsm and Extraverson ersonalty scale. For overvews on some of the statstcs and rocedures avalable n erson-ft research, see Karabatsos (2003) and Mejer and Sjtsma (995, 200). Although there are many erson-ft statstcs avalable n the lterature, most statstcs have been roosed n the context of arametrc IRT. In many test alcatons, however, t can be very convenent to have erson-ft statstcs that do not requre the estmaton of arametrc IRT arameters, but nstead only requre nonarametrc IRT tem ndces such as the tem roorton-correct scores. For examle, for almost all tests and questonnares that are evaluated n the Dutch Ratng System for Test Qualty (Evers, Sjtsma, Lucassen, & Mejer, 200) no nformaton s avalable wth resect to arametrc IRT arameters, and ths also ales to tests n other countres (Gesnger, 202).
Mejer and Sjtsma (200) dscussed early attemts to formulate nonarametrc IRT statstcs (see also Emons, 2008; Mejer, 994). When alyng these statstcs, however, there s a lac of research to hel one decde when to classfy an tem score attern as ether fttng or msfttng. To classfy a attern as (ms)fttng, a dstrbuton s needed. Ths dstrbuton can be based on emrcal results, exact results, asymtotc results, or smulaton. For arametrc IRT, there are studes that dscuss these dstrbutons (e.g., Drasgow, Levne, & McLaughln, 99; Mags, Raîche, & Béland, 202; Mejer & Tendero, 202; Snjders, 200). For nonarametrc aroaches, however, there are almost no studes that dscuss when to classfy an tem score attern as msfttng. Although one can always use some rule of thumb such as, for examle,.5 standard devatons from the mean score (see, e.g., Tuey, 977), n many stuatons nformaton about the robablty of the realzaton of a artcular score attern would hel researchers to decde when to classfy a attern as msfttng. In ths reort we dscuss an aroach to classfy a score attern as msfttng (aberrant) wthout assumng a arametrc model: the robablty of exceedance (PE). We show that for tests of moderate length, the exact dstrbuton of the PE can be used to classfy a score attern as normal or aberrant. The rocedures to comute the PE for ndvdual resonse vectors and the assocated exact dstrbutons are exlaned and are artly based on earler wor by van der Fler (982). Furthermore, we show that volatons of the assumton of nvarant tem orderng (IIO; Mejer & Egbern, 202; Sjtsma & Juner, 996; Sjtsma, Mejer, & van der Ar, 20) roerty do not seem to have a large mact on the erformance of the PE statstc. Fnally, the extenson of the PE statstc to long tests s dscussed. A Nonarametrc Aroach to Person Ft In the resent study, nonarametrc IRT models (NIRT; Moen & Lews, 982; Sjtsma & Molenaar, 2002) are used. The man goal n NIRT s to use the scores of a grou of ersons on a test to ran the ersons on an assumed latent trat (and not to estmate each erson s -score, as s done n arametrc IRT). Let random varable X denote the score on tem (,,, where s the numbers of tems n the test or questonnare). All tems consdered n ths reort are dchotomous; hence tem scores are equal to ether 0 or (to code ncorrect or correct answers, resectvely). The tem resonse functon (IRF) s defned as P X P, () where,,. In NIRT the IRFs are defned as dstrbuton-free functons of the latent ablty. The followng three assumtons n NIRT are tycally used (e.g., Sjtsma & Molenaar, 2002,. 8): 2
. Undmensonalty: All tems n the test are desgned to redomnantly measure the same latent trat. 2. Local ndeendence, that s, answers to dfferent tems are statstcally ndeendent condtonal on, therefore X x X x X x P,, P. 3. Latent monotoncty of the IRFs: Each IRF s monotone nondecreasng n, that a b for all a and b,,. It can be observed that these assumtons also aly to the most common arametrc IRT models (e.g., the logstc models; see Embretson & Rese, 2000). s, P P In ths sense, one can regard arametrc IRT models as nonarametrc IRT models wth added constrants (namely, n the functonal relatonsh between and the robablty of answerng the tem correctly). The NIRT model that satsfes assumtons 3 s nown as the Monotone Homogenety Model (MHM). When the tems are dchotomous, the MHM mles a stochastc orderng of the latent trat (SOL) by means of the total score statstc X + x (Hemer, Sjtsma, Molenaar, & Juner, 997): P ( c X s) P( c X t), (2) for any fxed value c and for any total scores 0 s t. SOL s an mortant roerty because t justfes the common use of the total score X to nfer the orderng of the ersons on the unobservable latent scale. IIO s a condton that eases the nterretaton of erson-ft results, and t s used to nterret PE. The tems of a test satsfy the IIO assumton f P P2 P for all. (3) In other words, IIO means that the IRFs do not ntersect (Sjtsma & Molenaar, 2002,. 2). A model verfyng IIO allows for an orderng of the tems that s ndeendent from. In ractcal terms, ths means that the relatve dffculty of the tems s the same across the entre latent scale. IIO may seem to be a strong assumton that s dffcult to satsfy n ractce for some datasets (e.g., Lgtvoet, van der Ar, te Marvelde, & Sjtsma, 200; Sjtsma & Juner, 996; Sjtsma et al., 20). However, Mejer and Egbern (202) and Mejer, Tendero, and Wanders (203) found that, for clncal scales, IIO was satsfed n many cases once low dscrmnatng tems were removed from the data. Besdes, Sjtsma and 3
Mejer (200) showed that the erformance of a number of erson-ft statstcs was robust aganst volatons of IIO. We assume IIO when we dscuss PE. Practcal consequences of volatng IIO wth resect to the PE statstc wll be addressed later n the reort. The Probablty of Exceedance Defne the roorton-correct score (-value) of tem by P ( ) f ( ) d, where f s the densty of ablty n the oulaton. Recall that the IIO assumton allows orderng the tems condtonal on ; see Equaton (3). The robablty that random X,, wll be equal to a secfc resonse vector x ( x, x2,, x ) can then be defned by vector X X P x x ( ) x X x. (4) Pattern x devates from the exected score attern f too many easy tems (.e., tems wth large ) are answered ncorrectly, and/or f too many dffcult tems (.e., tems wth low ) are answered correctly. In ths sense, x s consdered devant or aberrant when t does not closely match the exected score attern that s suggested by the oulaton s -values. The robablty of exceedance (PE) of an tem score attern x, PE( x ), s a measure of the devance between x and the exected score attern. PE( x ) s defned as the sum of the robabltes of all resonse vectors whch are, at most, as lely as, x gven the total score X : PE x X y X, (5) where the summaton extends to all resonse vectors y wth total score X verfyng + P P. y x Equaton 5 may be equvalently wrtten as x y:yx, P y x y:y X X y P X y PE. (6) We observe that because the comutaton of PE reles on tem resonse vectors that were not necessarly observed n the data (secfcally, resonse vectors less lely than x ), the PE erson-ft statstc does not follow the lelhood rncle (Brnbaum, 962; Lee, 2004). 4
The PE statstc requres assumtons 3 to hold; that s, the MHM should ft the data adequately. Condtonng the robabltes on the rght-hand sde of Equaton (5) on the total score s based on the stochastc orderng of the subjects on the scale. Therefore, the PE of a score attern x accumulates evdence aganst x based on a suboulaton of ersons wth the same latent trat; see Equaton (2). Resonse attern x s consdered devant or aberrant wth resect to the oulaton s exected score f PE x s smaller than a redefned level (e.g.,.05 or.0, or some redefned ercentle of the exact dstrbuton of PE). The IIO assumton s useful for a correct nterretaton of the PE because of the unque orderng of the tems by ther -values across. Volatons of IIO may lead to stuatons where the relatve dffculty of the tems changes for dfferent values of, that s, where Equaton (3) s volated. Such a volaton uts the adequacy of the orderng of the -values across the latent ablty scale at rs. Assumng IIO avods ths nd of ambguty when nterretng tem scores. Therefore, t s generally useful to use a model meetng the IIO assumton when the man goal s to comare tem score atterns between ersons wth dfferent total scores, as s the case n erson-ft analyses. Thus, the addtonal assumton of IIO states that the overall tem orderng accordng to the -values s the same for dfferent trat values. The concet of PE s llustrated n Table usng a short test wth = 5 tems. All tem score atterns wth total score X = 3 are enumerated: There are 0 dfferent atterns. Items are ordered by ncreasng order of dffculty (-values: =.8, 2 =.7, 3 =.5, 4 =.4, and 5 =.3), although such an orderng s not strctly necessary; see Equatons (4) and (5). Column P x dslays the robablty of each resonse score attern comuted usng Equaton (4). For examle, the robablty of resonse attern 5 s equal to ( ) 2 3 4 ( 5 ) =.2*.7*.5*.4*.7 =.096. Observe that the 0 resonse atterns are ordered by ncreasng order of P x. Ths mles that resonse atterns are ordered by decreasng order of msft from the exected score attern. Column PE x quantfes how much each resonse attern devates from the oulaton s exected resonse attern based on the values; see Equaton (5). For examle, the PE of resonse attern 3 s equal to (.0036 +.0084 +.026)/.3602 =.0683, where.3602 s the sum of the robabltes of the 0 resonse atterns wth a total score of 3. It can be seen that the least devant resonse vector s the Guttman attern (,,, 0, 0), whereas the most devant resonse vector s the reversed Guttman attern (0, 0,,, ) (Guttman, 944, 950). Hence, PE can be regarded as a erson-ft statstc that s senstve to devatons from the Guttman model. 5
TABLE Resonse vector robabltes and robabltes of exceedance of all resonse vectors wth three correct answers n a test wth fve tems Score Pattern x Item Item 2 Item 3 Item 4 Item 5 ( =.8) ( 2 =.7) ( 3 =.5) ( 4 =.4) ( 5 =.3) Px PEx 0 0.0036.000 2 0 0.0084.0333 3 0 0.026.0683 4 0 0.044.083 5 0 0.096.627 6 0 0.026.2227 7 0 0.0336.359 8 0 0.0504.4559 9 0 0.0784.6735 0 0 0.76.0000 = -value of tem,,5 ; of exceedance of resonse score attern x. P x = robablty of resonse score attern x ; PE x = robablty Ths examle can be extended to larger tests. The comlete enumeraton of resonse score atterns for gven values of and X becomes comutatonally more ntensve for larger number of tems. The comutaton of the exact dstrbuton of PE s feasble for tests consstng of u to = 20 tems usng currently avalable ersonal comuters. Table 2 gves an dea of the number of ossble resonse atterns for gven values of and X. Aendx A shows the R (R Develoment Core Team, 20) code that can be used to erform all the necessary comutatons. TABLE 2 Number of ossble resonse atterns gven the total number of tems and the total correct score X + X Number of Patterns 5 3 0 0 5 252 5 8 6435 20 0 84,756 25 2 5,200,300 30 5 55,7,520 50 25 26,40,606,437,752 We observe that the estmaton of the cutoff level for the PE statstc s deendent on the tye of data to be analyzed. Factors such as the number of tems n the scale, the tem -values, and the total sum score lay a role n determnng the exact dstrbuton of the PE statstc. Several aroaches can be conceved to determne adequate cutoff values, such as usng redefned cutoff values (%, 5%, or 0%), usng ercentles of the emrcal dstrbuton, usng ercentles of the exact dstrbuton, or estmatng cutoff values usng bootstrang rocedures. The researcher should decde, n each stuaton, whch aroach rovdes the most sensble results n terms of false/true-ostve rates. 6
Smulaton Study We nvestgated the erformance of the PE statstc n a smulaton study. The goal was to gan a clearer mresson concernng PE s detecton rates and robustness aganst volatons of the NIRT model assumtons, under two dfferent settngs. In artcular, we were nterested n studyng the robustness of PE aganst volatons of IIO usng smulated data. As dscussed before, IIO s a useful roerty n the framewor of erson ft because t allows a unque orderng of the tems across the latent scale. A volaton of Equaton (3) may lead to ntervals on the scale where, say, tem s easer than tem j, and ntervals where the reverse s also ossble. Such stuatons should be avoded whenever ossble. In ths secton we resent the results of a smulaton study that nvestgated how much the PE statstc s affected by volatons of IIO (and by other necessary model assumtons, to be resented shortly). Several rocedures from the moen R acage (van der Ar, 2007, 202) were used to chec the ft of the MHM to the data as well as volatons of IIO. We followed general gudelnes gven by Sjtsma et al. (20) and Mejer and Tendero (202) to erform our analyses. In artcular, Mejer and Tendero (202) ndcated that before nvestgatng erson ft t s mortant to frst chec whether the IRT model fts the data; f not, msft s dffcult to nterret. Hence, secal attenton s ad to model fttng ror to erson-ft assessment. Data Smulaton: Normal Resonse Patterns Twenty dfferent datasets wth scores of,000 ersons on 5 tems were generated usng the one-arameter logstc model wth tem dscrmnaton equal to.7 for every tem; tem-dffculty and erson-ablty arameters were drawn from the standard normal dstrbuton. The number and ercentages of test taers dslayng aberrant behavor equaled N.aberr = 0, 50, and 00, corresondng to %, 5%, and 0% of the test taers, resectvely, and the number and ercentages of tems answered aberrantly equaled.aberr = 3, 5, and 0, corresondng to 20%, 33%, and 66% of the tems, resectvely. Test taers were randomly selected to dslay aberrant behavor accordng to two crtera: The latent ablty should be low (more recsely, < ), and the total sum score on the (.aberr+3) most dffcult tems of the scale should not exceed 3. Two tyes of aberrant behavor were mmced n our smulaton: Cheatng and random resondng. In the case of cheatng,.aberr 0s out of the (.aberr+3) most dffcult tems were randomly selected and changed nto s. In the case of random resondng, each of the.aberr 0s out of the (.aberr+3) most dffcult tems were changed nto s wth a robablty of.25. Moreover, 20 relcatons were smulated for each condton. Ths framewor served as the bass for our smulaton study. We started the analyss by confrmng that some necessary condtons for the MHM to hold were met for the 20 datasets generated (ror to mutng aberrant behavor). All nter-tem covarances were non-negatve (Sjtsma & Molenaar, 2002, Theorem 4.), and all tem-ar scalablty coeffcents H j satsfed 0 < H j < (Sjtsma & Molenaar, 2002, Theorem 4.3). Moreover, no volatons of monotoncty were found. The Automated Item Selecton Procedure (AISP; Moen, 97; Sjtsma & Molenaar, 2002) was used to select tems that comly wth the MHM (.e., such that all nter-tem 7
covarances are ostve and all tem scalablty coeffcents H are larger than a secfed lower bound c =.3). All 5 tems were selected by the AISP, thus assurng the scalablty of the generated set of tems. The overall scalablty coeffcents of the 20 datasets vared between H.42 and H.50, and hence the scale can be consdered moderate n recson wth resect to ts ablty to order ersons on the latent scale by means of the total scores (Moen, 97). Also, we used several methods avalable n the moen acage to loo for volatons of IIO: The matrx method (Molenaar & Sjtsma, 2000), the rest score method (Molenaar & Sjtsma, 2000), and the manfest IIO method (MIIO; Lgtvoet et al., 200). No sgnfcant volatons of IIO were found. The SD =.08) and thus the recson of the tem orderng was, on average, medum (Lgtvoet et al., 200). T H coeffcents of the 20 datasets vared between.32 and.64 (mean =.48, Data Smulaton: Aberrant Resonse Patterns Next, aberrant behavor (cheatng, random resondng) was nutted n each 5-tem dataset followng the rocedure revously exlaned. The robablty of exceedance was then comuted for each of the,000 test taers. A cutoff level of.0 was used as the crteron to flag tem score vectors: Vectors x verfyng PE x <.0 were flagged as otentally dslayng aberrant behavor. Ths cutoff level et emrcal Tye I error rates between % and 3% n case of cheatng (Table 3) and between 2% and 3% n case of random resondng (Table 4). TABLE 3 Emrcal Tye I error rates (mean, SD) n the cheatng behavor case, across 20 relcatons, usng PE =.0 as threshold Total Number of Items N.aberr.aberr 5 6 7 8 9 20 0 3.02 (.006).03 (.005).03 (.006).02 (.005).03 (.007).03 (.006) 5.02 (.006).03 (.005).03 (.006).02 (.005).02 (.007).03 (.005) 0.02 (.006).03 (.005).03 (.005).02 (.005).02 (.007).03 (.006) 50 3.02 (.006).02 (.004).02 (.005).02 (.005).02 (.007).02 (.005) 5.02 (.006).02 (.005).02 (.005).02 (.005).02 (.007).02 (.006) 0.02 (.007).02 (.006).02 (.006).02 (.006).02 (.008).02 (.006) 00 3.02 (.006).02 (.005).02 (.006).02 (.005).02 (.007).02 (.006) 5.0 (.006).02 (.005).0 (.005).02 (.005).02 (.006).02 (.006) 0.02 (.007).02 (.007).02 (.008).02 (.008).02 (.009).02 (.008) N.aberr = number of smulated test taers dslayng cheatng behavor;.aberr = number of tems whose scores have been changed n order to dslay cheatng behavor. 8
TABLE 4 Emrcal Tye I error rates (mean, SD) n the random resondng behavor case, across 20 relcatons, usng PE =.0 as threshold Total Number of Items N.aberr.aberr 5 6 7 8 9 20 0 3.02 (.006).03 (.005).03 (.006).03 (.005).03 (.007).03 (.006) 5.02 (.007).03 (.005).03 (.006).03 (.005).03 (.007).03 (.006) 0.02 (.006).03 (.005).03 (.006).03 (.005).03 (.007).03 (.006) 50 3.02 (.006).02 (.005).03 (.006).02 (.005).02 (.007).03 (.005) 5.02 (.006).02 (.005).02 (.006).02 (.005).02 (.006).03 (.005) 0.02 (.006).03 (.005).03 (.005).03 (.005).03 (.007).03 (.005) 00 3.02 (.006).02 (.005).02 (.006).02 (.005).02 (.007).02 (.005) 5.02 (.006).02 (.005).02 (.006).02 (.005).02 (.007).02 (.006) 0.02 (.007).02 (.006).02 (.006).02 (.005).02 (.007).02 (.005) N.aberr = number of smulated test taers dslayng random resondng behavor;.aberr = number of tems whose scores have been changed n order to dslay random resondng behavor. Scores on fve addtonal tems were generated usng a smlar rocedure as before excet for the dscrmnaton arameter, whch was now fxed at.2 for these tems. We combned the scores on the orgnal 5 tems wth the scores on these extra, 2,, 5 tems. The total number of tems n the scale (varable length) was also used as a factor to exlan the fndngs n the smulaton study. It was exected that the dfferent dscrmnaton values used to generate the scores on the addtonal set of tems would lead to an ncreasng number of sgnfcant volatons of IIO (Sjtsma & Molenaar, 2002). Results: Model Volatons The effects of N.aberr,.aberr, and length on the mean number of sgnfcant volatons to IIO were analyzed usng full factoral models. Results are summarzed n the to anels of Tables 5 and 6. Factor length had ndeed the largest effect on the number of sgnfcant volatons to IIO n the random resondng stuaton. Smlar results were found n the cheatng stuaton wth the exceton of the matrx crteron. We also nsected whether other NIRT model assumtons were affected by our data manulaton (concernng the mutaton of aberrant behavor and the addton of one through fve extra tems to the data). We checed whether nter-tem covarances were non-negatve (Sjtsma & Molenaar, 2002, Theorem 4.); tem-ar scalablty coeffcents H j were between 0 and (Sjtsma & Molenaar, 2002, Theorem 4.3), and IRFs dslayed monotoncty. Full factoral models (man effects: N.aberr,.aberr, and length) were ftted. It was verfed that all model assumtons were sgnfcantly affected [non-negatve nter-tem covarances: F(9,070) = 66.42, <.0, R 2 =.36; H j coeffcents between 0 and : F(9,070) = 50.3, <.0, R 2 =.56; monotoncty: F(9,070) = 9.29, <.0, R 2 =.43]. The bottom anels of Tables 5 and 6 show the effect szes for each factor. It can be verfed that N.aberr (the roorton of subjects n the samle dslayng aberrant behavor) was the factor wth the largest effect on the volatons of the NIRT model condtons. 9
TABLE 5 Effect of N.aberr,.aberr, and length on the deteroraton of NIRT model assumtons n the cheatng settng N.aberr.aberr length Global Effect matrx ω 2 =.36 ω 2 =.9 ω 2 =.05 R 2 =.62 restscore ω 2 =.2 ω 2 =.8 ω 2 =.23 R 2 =.62 MIIO ω 2 =.9 ω 2 =.7 ω 2 =.27 R 2 =.63 Cov ω 2 =.32 ω 2 =.0 ω 2 =.02 R 2 =.36 Monot ω 2 =.4 ω 2 =.02 ω 2.0 R 2 =.43 H j ω 2 =.52 ω 2 =.04 ω 2.0 R 2 =.56 N.aberr = number of smulated test taers dslayng cheatng behavor;.aberr = number of tems whose scores have been changed n order to dslay cheatng behavor; length = number of tems n the dataset; global effect = ft of the regresson model usng man effects only. To anel: Mean number of sgnfcant volatons to IIO accordng to the matrx, restscore, and MIIO rocedures. Bottom anel: Cov = roorton of the relcatons n whch all nter-tem covarances are non-negatve; Monot = mean number of sgnfcant volatons to monotoncty; H j = mean number of sgnfcant volatons to tem-ar scalablty coeffcents verfyng 0 < H j <. TABLE 6 Effect of N.aberr,.aberr, and length on the deteroraton of NIRT model assumtons n the random resondng settng N.aberr.aberr length Global Effect matrx ω 2 =.0 ω 2.0 ω 2 =.05 R 2 =.07 restscore ω 2.0 ω 2 =.0 ω 2 =.4 R 2 =.43 MIIO ω 2 =.0 ω 2 =.0 ω 2 =.44 R 2 =.47 Cov ω 2 =.0 ω 2 =.0 ω 2 =.09 R 2 =.20 Monot ω 2 =.20 ω 2 =.02 ω 2.0 R 2 =.24 H j ω 2 =.7 ω 2 =.02 ω 2 =.0 R 2 =.2 N.aberr = number of smulated test taers dslayng random resondng behavor;.aberr = number of tems whose scores have been changed n order to dslay random resondng behavor; length = number of tems n the dataset; global effect = ft of the regresson model usng man effects only. To anel: Mean number of sgnfcant volatons to IIO accordng to the matrx, restscore, and MIIO rocedures. Bottom anel: Cov = roorton of the relcatons n whch all nter-tem covarances are non-negatve; Monot = mean number of sgnfcant volatons to monotoncty; H j = mean number of sgnfcant volatons to tem-ar scalablty coeffcents verfyng 0 < H j <. Summarzng, we concluded that addng one through fve dfferently dscrmnatng tems to the ntal set of 5 tems dd lead to a sgnfcant ncrease n the number of volatons to IIO. Ths effect was more evdent n the random resondng settng than n the cheatng settng. Moreover, t was verfed that other NIRT model assumtons (nonnegatve nter-tem covarances, H j coeffcent between 0 and, and latent monotoncty) were mostly affected by the number of subjects n the samle that dslayed aberrant behavor. We cannot stress enough how mortant t s to carefully chec, and reort, model assumtons before attemtng any nd of erson-ft analyses. 0
Results: Detecton Rate of the PE Our next ste was to analyze the detecton rate of the PE under both aberrant behavors consdered n ths study. Fgures and 2 (for cheatng and random resondng, resectvely) dslay the detecton rates found usng a PE threshold value of.0; Table 7 summarzes the sze of the effects found. The number of tems dslayng aberrant behavor (.aberr) had the largest effect on the detecton rates n both settngs. Interestngly, the detecton rates were hgher for a moderate value of.aberr (= 5); both low and large values of.aberr are assocated wth lower ower. Ths fndng s n lne wth results n St-Onge, Valos, Abdous, and German (20), who showed that the detecton rates of several erson-ft statstcs ncrease wth aberrance rates only to some ont, after whch a decrease s to be exected. It was also observed that the cheatng detecton rates decreased wth N.aberr. Ths can be understood by observng that the cheatng behavor that we muted led to sum-score dfferences (before versus after cheatng mutaton), whch had a large mact on the orgnal sum-scores (whch were tycally very low). In other words, the erformance of PE seemed to decrease when the aberrance rate ncreased beyond moderate boundares (St-Onge et al., 20). The mutaton of random resondng behavor, on the other hand, was mlder (the selected 0s were changes nto s wth a robablty of.25). Ths ntroduced a more moderate rate of aberrant behavor n the data, and the PE statstc erformed accordngly (for.aberr = 3, 0): Its detecton rate mroved wth N.aberr. The.aberr = 5 n the random resondng case was dfferent because the PE s detecton rate decreased wth N.aberr. FIGURE. Cheatng detecton rate of PE for a number of tems equal to 5 (to left), 6 (to mddle), 7 (to rght), 8 (bottom left), 9 (bottom mddle), and 20 (bottom rght).
FIGURE 2. Random resondng detecton rate of PE for a number of tems equal to 5 (to left), 6 (to mddle), 7 (to rght), 8 (bottom left), 9 (bottom mddle), and 20 (bottom rght). TABLE 7 Effect of N.aberr,.aberr, and length on detecton rates for cheatng and random resondng N.aberr.aberr length Global Effect Cheatng ω 2 =.09 ω 2 =.22 ω 2 =.03 R 2 =.34 Random resondng ω 2 =.00 ω 2 =.26 ω 2 =.0 R 2 =.27 Note. N.aberr = number of smulated test taers dslayng aberrant behavor;.aberr = number of tems whose scores have been changed n order to dslay aberrant behavor; length = number of tems n the dataset; global effect = ft of the regresson model usng man effects only. Once more, the exlanaton resdes n the balance that must exst between the erformance of a erson-ft statstc (PE n ths study) and the level of aberrant rate n the data. When.aberr = 5 the actual detecton rate s larger than for the other values consdered, but addng more and more aberrant test taers to the set dd surass some breaont of the PE statstc, whch affected ts erformance for hgher rates of aberrant test taers n the data. In general, t can be concluded that the PE erformed very well n the cheatng case and moderately well n the random resondng case. The PE statstc dd not seem to be overly affected by volatons of IIO or other model assumtons. Several factors, such as the number of test taers and tems dslayng aberrant behavor, must be taen nto account when judgng the erformance of PE. Also, we stress two mortant deas that should be taen nto account when attemtng to erform any erson-ft analyss (usng PE or any other statstc). We fnd t mortant to chec whether the tem resonse model of choce fts the data adequately (as we dd) and to chec how the erformance of the erson-ft statstc s overly affected by the several factors that lay a role n ft measurement. 2
Asymtotc Dstrbuton of the PE Statstc One lmtaton of the PE statstc s that ts comutaton requres a comlete enumeraton of all resonse atterns wth the same length and total-correct score as the resonse attern under nsecton. Ths tas becomes demandng for numbers of tems larger than, say, 20 on an average ersonal comuter. Table 2 llustrates how qucly the total number of resonse vectors ncreases as the number of tems ncreases. Deendng on the number of tems, t mght be ossble to crcumvent the roblem by usng suercomuters. Nevertheless, t would be useful to aroxmate the exact dstrbuton of the PE statstc through an asymtotc dstrbuton for long tests. In Aendx B we show the statstcal dervaton (based on revous wor by van der Fler, 982) that we used n an attemt to aroxmate the exact dstrbuton of PE for large tests. We confrmed that ths aroxmate dstrbuton wored well only for a very lmted range of stuatons (.e., when all the -values are very close to each other) for tests consstng of 20 tems. Hence, t s stll not clear how many tems are requred n order for the aroxmate dstrbuton to be useful for long tests. Future research that can clarfy ths ssue, or that ossbly resents dfferent dstrbutonal alternatves, s needed. Dscusson In ths reort, we dscussed a nonarametrc statstc to detect msfttng tem score atterns that s based on comlete enumeraton of all ossble tem score atterns. A bg advantage of ths method comared to exstng methods s that racttoners can use the PE usng a resecfed robablty level. A drawbac s that t can only be used for tests of moderate length due to the rad ncrease of comutatonal labor as the number of tems ncreases. It s mortant to observe that the rocedure used here does not guarantee that aberrant behavor dd ndeed tae lace whenever a flaggng occurs. The PE, as s usually the case for nterretng erson-ft statstcs, can only rovde an ndcaton of the resence of aberrant behavor. The PE should not be used as conclusve evdence that aberrant behavor dd occur. Some follow-u strateges (e.g., ntervewng the flagged test taers, ntervewng the roctors, consultng the seatng charts) could rovde more substantve nformaton. In ractce some tems may not ft the IRT model. A researcher fnds hmself then n the vexng oston of havng to remove tems because of nferor sychometrc qualty and eeng tems n the scale because longer tests are better suted to detect erson msft (Mejer, Sjtsma, & Molenaar, 995). There are good arguments, however, n favor of frst nvestgatng the scale qualty of a set of tems before conductng erson-ft research. Insectng the sychometrc qualty of the tems and removng tems wth nsuffcent qualty reduces the error comonent when we try to nterret msfttng resonse behavor. When an tem cannot be descrbed by an IRT model (e.g., because t correlates negatvely wth other tems), or when an tem has low dscrmnaton (.e., low H value), ts score s a very unrelable ndcator of the latent varable. Tang these tem scores nto account to assess erson ft ncreases the error comonent, and thus hnders the (sychologcal) nterretaton. Note that the PE only relates the tems wth 3
the roorton-correct scores and thus does not account for the dscrmnaton of an tem. Thus, t s mortant to chec for tems wth low dscrmnatng ower, as we dd. Fnally, n ths study we dscussed the PE for dchotomous tems, whch are often encountered n educatonal and ntellgence testng. However, ths rocedure can be generalzed to olytomous tem scores, whch wll be a toc for future research. References Brnbaum, A. (962). On the foundatons of statstcal nference. Journal of the Amercan Statstcal Assocaton, 57(298), 269 306. do: 0.080/062459.962.0480660 Drasgow, F., Levne, M. V., & McLaughln, M. E. (99). Arorateness measurement for some multdmensonal test batteres. Aled Psychologcal Measurement, 5(2), 7 9. do:0.77/04662690500207 Embretson, S. E., & Rese, S. P. (2000). Item resonse theory for sychologsts. Mahwah, NJ US: Lawrence Erlbaum Assocates Publshers. Emons, W. H. M. (2008). Nonarametrc erson-ft analyss of olytomous tem scores. Aled Psychologcal Measurement, 32(3), 224 247. do:0.77/04662607302479 Evers, A., Sjtsma, K., Lucassen, W., & Mejer, R. R. (200). The Dutch revew rocess for evaluatng the qualty of sychologcal tests: Hstory, rocedure, and results. Internatonal Journal of Testng, 0(4), 295 37. do:0.080/5305058.200.58325 Ferrando, P. J. (202). Assessng nconsstent resondng n E and N measures: An alcaton of erson-ft analyss n ersonalty. Personalty and Indvdual Dfferences, 52(6), 78 722. do:0.06/j.ad.20.2.036 Gesnger, K. F. (202). Worldwde test revewng at the begnnng of the twenty-frst century. Internatonal Journal of Testng, 2(2), 03 07. do:0.080/5305058.20.65545 Guttman, L. (944). A bass for scalng qualtatve data. Amercan Socologcal Revew, 9, 39 50. do:0.2307/2086306 Guttman, L. (950). The bass for scalogram analyss. In S. A. Stouffer et al. (Eds.), Measurement and recson (. 60 90). Prnceton NJ: Prnceton Unversty Press. Hemer, B. T., Sjtsma, K., Molenaar, I. W., & Juner, B. W. (997). Stochastc orderng usng the latent trat and the sum score n olytomous IRT models. Psychometra, 62(3), 33 347. do:0.007/bf02294555 4
Internatonal Test Commsson (20). ITC gudelnes for qualty control n scorng, test analyss, and reortng of test scores. Retreved from htt://ntestcom.org. Karabatsos, G. (2003). Comarng the aberrant resonse detecton erformance of thrty-sx erson-ft statstcs. Aled Measurement In Educaton, 6(4), 277 298. do:0.207/s532488ame604_2 Lee, P. M. (2004). Bayesan statstcs: An ntroducton. West Sussex, UK: John Wley & Sons Ltd. Lgtvoet, R., van der Ar, L. A., te Marvelde, J. M., & Sjtsma, K. (200). Investgatng an nvarant tem orderng for olytomously scored tems. Educatonal and Psychologcal Measurement, 70(4), 578 595. do:0.77/00364409355697 Mags, D., Raîche, G., & Béland, S. (202). A ddactc resentaton of Snjders s lz* ndex of erson ft wth emhass on resonse model selecton and ablty estmaton. Journal of Educatonal and Behavoral Statstcs, 37(), 57 8. do:0.302/07699860396894 Mejer, R. R. (994). The number of Guttman errors as a smle and owerful erson-ft statstc. Aled Psychologcal Measurement, 8(4), 3 34. do:0.77/046626940800402 Mejer, R. R., & Egbern, I. J. L. (202). Investgatng nvarant tem orderng n ersonalty and clncal scales: Some emrcal fndngs and a dscusson. Educatonal and Psychologcal Measurement, 72(4), 589 607. do:0.77/003644429344 Mejer, R. R., Egbern, I. J. L., Emons, W. H. M., & Sjtsma, K. (2008). Detecton and valdaton of unscalable tem score atterns usng tem resonse theory: An llustraton wth Harter s self-erceton rofle for chldren. Journal of Personalty Assessment, 90(3), 227 238. do:0.080/002238907088492 Mejer, R. R., & Sjtsma, K. (995). Detecton of aberrant tem score atterns: A revew of recent develoments. Aled Measurement n Educaton, 8(3), 26 272. do:0.207/s532488ame0803_5 Mejer, R. R., & Sjtsma, K. (200). Methodology revew: Evaluatng erson ft. Aled Psychologcal Measurement, 25(2), 07 35. do:0.77/0466202203957 Mejer, R. R., Sjtsma, K., & Molenaar, I. W. (995). Relablty estmaton for sngle dchotomous tems based on Moen s IRT model. Aled Psychologcal Measurement, 9(4), 323 335. do:0.77/046626950900402 Mejer, R. R., & Tendero, J. N. (202). The use of the lz and lz* erson-ft statstcs and roblems derved from model mssecfcaton. Journal of Educatonal and Behavoral Statstcs, 37(6), 758 766. do:0.302/0769986246644 5
Mejer, R. R., Tendero, J. N., & Wanders, R. B. K. (n ress). The use of nonarameterc IRT to exlore data qualty. Handboo of Item Resonse Theory Methods as Aled to Patent Reorted Outcomes. Moen, R. J. (97). A theory and rocedure of scale analyss. Berln, Germany: De Gruyter. Moen, R. J., & Lews, C. (982). A nonarameterc aroach to the analyss of dchotomous tem resonses. Aled Psychologcal Measurement, 6(4), 47 430. do:0.77/0466268200600404 Molenaar, I. W., & Sjtsma, K. (2000). User s manual MSP5 for wndows. Gronngen: IEC ProGAMMA. R Develoment Core Team (20). R: A language and envronment for statstcal comutng. Venna, Austra: R Foundaton for Statstcal Comutng. Retreved from htt://www.r-roject.org/. Sjtsma, K., & Juner, B. W. (996). A survey of theory and methods of nvarant tem orderng. Brtsh Journal of Mathematcal and Statstcal Psychology, 49(), 79 05. do:0./j.2044-837.996.tb0076.x Sjtsma, K., & Mejer, R. R. (200). The erson resonse functon as a tool n erson-ft research. Psychometra, 66(2), 9 207. do:0.007/bf02294835 Sjtsma, K., Mejer, R. R., & van der Ar, L. A. (20). Moen scale analyss as tme goes by: An udate for scalng racttoners. Personalty and Indvdual Dfferences, 50(), 3 37. do:0.06/j.ad.200.08.06 Sjtsma, K., & Molenaar, I. W. (2002). Introducton to nonarametrc tem resonse theory. Thousand Oas, CA: SAGE Publcatons, Inc. Snjders, T. A. B. (200). Asymtotc null dstrbuton of erson ft statstcs wth estmated erson arameter. Psychometra, 66(3), 33 342. do:0.007/bf02294437 St-Onge, C., Valos, P., Abdous, B, & German, S. (20). Accuracy of erson-ft statstcs: A Monte Carlo study of the nfluence of aberrance rates. Aled Psychologcal Measurement, 35(6), 49 432. do:0.77/046626039777 Tuey, J. W. (977). Exloratory data analyss. Readng, MA: Addson-Wesley Publshng Comany. van der Ar, L. A. (2007). Moen scale analyss n R. Journal of Statstcal Software, 20(), 9. 6
van der Ar, L. A. (202). New develoments n Moen scale analyss n R. Journal of Statstcal Software, 48(5), 27. van der Fler, H. (982). Devant resonse atterns and comarablty of test scores. Journal of Cross-Cultural Psychology, 3(3), 267 298. do:0.77/0022002820300300 7
Aendx A: R Code for Comutng PE Functon unqueerm2 (retreved from htt://stacoverflow.com/questons/56749/ermute-all-unque-enumeratons-of-avector-n-r) generates all unque ermutatons of a dchotomous vector x of sze n. Functon PE comutes the PE for each erson (= row) of the dataset Data. unqueerm2 <- functon(x) { dat <- factor(x) N <- length(dat) n <- tabulate(dat) ng <- length(n) f(ng==) return(x) a <- N-c(0,cumsum(n))[-(ng+)] foo <- laly(:ng, functon() matrx(combn(a[],n[]),nrow=n[])) out <- matrx(na, nrow=n, ncol=rod(saly(foo, ncol))) xxx <- c(0,cumsum(saly(foo, nrow))) xxx <- cbnd(xxx[-length(xxx)]+, xxx[-]) mss <- matrx(:n,ncol=) for( n seq_len(length(foo)-)) { l <- foo[[]] nn <- ncol(mss) mss <- matrx(re(mss, ncol(l)), nrow=nrow(mss)) <- (re(0:(ncol(mss)-), each=nrow(l)))*nrow(mss) + l[,re(:ncol(l), each=nn)] out[xxx[,]:xxx[,2],] <- matrx(mss[], ncol=ncol(mss)) mss <- matrx(mss[-], ncol=ncol(mss))} <- length(foo) out[xxx[,]:xxx[,2],] <- mss out <- out[ran(as.numerc(dat), tes="frst"),] foo <- cbnd(as.vector(out), as.vector(col(out))) out[foo] <- x t(out)} PE <- functon(data){ Nsubs <- dm(data)[]; Ntems <- dm(data)[2]; f (Ntems > 20) {rnt("number of tems s > 20. Abort."); brea}; Ps <- as.vector(aly(data,2,sum) / Nsubs); Qs <- -Ps; ossble.nc <- as.numerc(levels(factor(aly(data,,sum)))) PEvec.data <- re(na,nsubs); for (tms n :length(ossble.nc)){ 8
NC <- ossble.nc[tms]; Data.NC <- unqueerm2(c(re(,nc),re(0,ntems-nc))); Nvecs.NC <- f (length(data.nc)==ntems){} else {dm(data.nc)[]}; Pvec <- NULL; f (Nvecs.NC > ){ for ( n :Nvecs.NC){ Pvec <- c(pvec,rod(ps^data.nc[,])*rod(qs^(-data.nc[,])))}} PEvec <- NULL; f (Nvecs.NC > ){ for ( n :Nvecs.NC){PEvec <- c(pevec,sum(pvec[pvec <= Pvec[]])/sum(Pvec))}} else {PEvec <- c()}; sutable.subs <- whch(aly(data,,sum) == NC); f (Nvecs.NC > ){ for ( n :length(sutable.subs)){ sub=; whle (sum(abs(data.nc[sub,]-data[sutable.subs[],])) > 0 & sub <= Nvecs.NC){sub <- sub+} PEvec.data[sutable.subs[]] <- PEvec[sub];}} else {for ( n :length(sutable.subs)){pevec.data[sutable.subs[]] <- }}} PEvec.data;} 9
20 Aendx B: An Attemt to Derve an Asymtotc Dstrbuton of the PE Statstc Consder the lelhood of resonse vector x ),,, ( 2 x x x wth total score X : L x = )) ( log( x X = x log log = V C. x (B-) The lelhood of x s a sum of two terms: a random varable denoted V x and a constant term that does not deend on. x If the number of tems s suffcently large ( 20 ) and f the set of -values dslays a reasonable varance (van der Fler, 982,. 295), then t can be shown that V x (condtonal on X ) s asymtotcally normally dstrbuted wth mean X ) ( )log ( log (B-2) and varance 2 2 2 ) ( )log ( log ) ( (B-3) (van der Fler, 982,. 295 296). As a consequence, t can be concluded that L x (condtonal on X ) s asymtotcally normally dstrbuted wth mean L ) log( and varance 2 2 L under the condtons revously stated. Ths result allows dervng an asymtotc dstrbuton for ) ex( ) ( x L x X condtonal on X : L L P P 0 0 0 log log ) ex( x x L L, (B-4) wth 0 between 0 and, and Ф denotng the normal dstrbuton functon.
We consdered usng the rght-sde exresson of Equaton (B-4) as an aroxmaton for PE x. The aroxmaton dd not wor well for tests consstng of 20 tems excet n cases where all the -values were very close to each other. Such cases are of lmted nterest n ractcal terms. It s also stll unclear how many tems are requred for the method to roduce relable aroxmatons of the PE for long tests. Answerng ths queston, or loong for dfferent alternatves that wor well n a wder range of cases, should be the focus of future research. 2