An Investigation of the Performance of the Generalized S-X 2 Item-Fit Index for Polytomous IRT Models. Taehoon Kang Troy T. Chen

Transcription

1

2 An Investgaton of the Performance of the eneralzed S-X Item-Ft Index for Polytomous IRT Models Taehoon Kang Troy T. Chen

3

4 Abstract Orlando and Thssen (, 3) proposed an tem-ft ndex, S-X, for dchotomous tem response theory (IRT) models, whch has performed better than tradtonal tem-ft statstcs such as Yen s (98) Q and McKnley and Mll s (985). Ths study extends the utlty of S-X to polytomous IRT models, ncludng the generalzed partal credt model (PCM: Murak, 99), partal credt model (PCM: Masters, 98), and ratng scale model (RSM: Andrch, 978). The performance of the generalzed S-X n assessng tem-model ft was studed n terms of emprcal Type I error rates and power as compared to results obtaned for provded by the computer program PARSCALE (Murak & Bock, 997). The results show that the generalzed S-X s a promsng tem-ft ndex for polytomous tems n educatonal and psychologcal testng programs. Keywords: tem response theory; tem ft; S-X ; polytomous IRT; PARSCALE

5

6 An Investgaton of the Performance of the eneralzed S-X Item-Ft Index for Polytomous IRT Models Introducton Ever snce Lawley (943) and Lord (95) establshed the basc concepts of tem response theory (IRT), many IRT models have been developed and appled to varous felds of study such as psychologcal scalng and educatonal measurement. Unless an approprate IRT model for a gven data set s used, however, the benefts of IRT for applcatons such as test development, tem bankng, dfferental tem functonng (DIF), computerzed adaptve testng (CAT), and test equatng mght not be attaned. In bref, the success of IRT applcatons requres satsfactory ft between the model and the data. The most crtcal problem caused by model-data msft may be that the hallmark feature of IRT, parameter nvarance, no longer apples (Shepard, Camll & Wllams, 984; Bolt, ; Rupp & Zumbo, 4). Numerous statstcal procedures have been developed to evaluate tem ft under an IRT model, and goodness-of-ft studes have been conducted and reported n the volumnous IRT lterature (see Bock, 97; Douglas & Cohen, ; las & Suarez-Falcon, 3; Lang & Wells, 7; McKnley & Mlls, 985; Orlando & Thssen,, 3; Snharay, 3, 5; Stone, ; Stone & Zhang, 3; Suarez-Falcon & las, 3; Wells, 4; Yen, 98). Among them, several Ch-square based tem-level goodness-of-ft ndces usng sgnfcance tests such as Yen s (98) Q for dchotomous tems, the tradtonal loglkelhood Ch-square,, for both dchotomous and polytomous tems (McKnley & Mlls, 985), and Orlando and Thssen s (, 3) S-X for dchotomous tems have been utlzed for IRT applcatons. Type I error rates for these goodness-of-ft ndces have been nvestgated and reported.

7 A shortcomng of the tem-ft tests based on and s ther senstvty to test length and sample sze. For nstance, on a short test of dchotomous tems, these tradtonal statstcs exhbted nflated emprcal Type I error rates as hgh as.96 and.97, Q respectvely, for a gven nomnal rejecton rate of α =.5 (Orlando & Thssen, ). DeMars (5) smulaton studes wth a sample sze of, used PARSCALE s (Murak & Bock, 997) ft ndex whch s smlar to McKnley and Mll s (985), and dscovered emprcal Type I error rates of.4 under the partal credt model (PCM: Masters, 98) and.34 under the graded response model (RM: Samejma, 969) on a polytomous tem test. Stone and Hansen () found that an tem-ft test usng for a 3 polytomous tem test under the RM showed nflated emprcal Type I error rates between.4 and.8 for cases wth, examnees, and between.9 and.396 for cases wth, examnees, even though the true tem parameter values were used n calculatng predcted proportons. Besdes studes on and, there have been noteworthy studes on S-X by Q Orlando and Thssen (, 3). In ther smulaton studes usng test lengths of, 4, and 8 dchotomous tems and a sample sze of,, they showed that S-X adequately controlled Type I error rates (Orlando & Thssen, ). Under the -, -, and 3-parameter logstc models (PLM, PLM, and 3PLM, respectvely), the emprcal Type I error rates for tests based on S-X were found to be between.4 and.7 wth nomnal α of.5. Addtonally, the emprcal power of S-X mproved as sample sze ncreased from 5 to, (Orlando & Thssen, 3). The S-X ndex could also be generalzed and appled to the goodness-of-ft test for polytomous tems (Roberts, n press). The man purpose of ths study s to assess the performances of the generalzed S-X under the polytomous IRT models ncludng the generalzed partal credt model (PCM: Murak, 99), PCM, and the ratng scale model (RSM: Andrch, 978) for dfferent combnatons of test length and sample sze. The paper

8 3 begns wth a revew of,, and S-X statstcs followed by a dscusson on the Q generalzaton of S-X for polytomous tems. Fnally, the performances of the generalzed S-X and PARSCALE s are compared through a smulaton study. Ch-Square Based Item Ft Indces Accordng to Hambleton, Swamnathan and Rogers (99), and Stone (), a common strategy for assessng tem-ft of an IRT model can be summarzed as follows: () estmate the tem and ablty parameters under the chosen model, () classfy examnees nto K homogeneous groups n terms of ther ablty estmates, (3) calculate observed response proportons n each group for the tem under nvestgaton, (4) derve predcted response proportons n each group usng the tem and ablty parameter estmates under the IRT model of nterest, and (5) compute Ch-square based statstcs by comparng the observed and predcted values. The Tradtonal Ch-Square Based Ft Indces Q Both and are consdered to be tradtonal Ch-square based ft ndces. Yen s (98) Q for a dchotomous tem can be expressed as follows: ( O E ) ( O E ) Q =, () kz kz k k N k = N k k = z= Ekz k = Ek( Ek ) where z ndcates the tem score, k represents the number of groups of examnees, N k s the number of examnes n group k, and (= - ) and (= - ) are, respectvely, Ok Ok Ek E k the observed and predcted proportons of correct responses for group k. To compute, the tem and ablty parameter estmates are frst obtaned under the chosen IRT model, and the total number of groups (K) s set to wth each group havng approxmately an equal number of examnees. The predcted proportons of correct responses,, s computed as the mean predcted proporton of each group. Snce K =, the degrees of freedom (df) E k Q

9 assocated wth Q 4 equal m where m s the number of tem parameters estmated. For assessng model-tem ft for both dchotomous and polytomous tems, PARSCALE provdes as the goodness-of-ft ndex. ven an tem denoted, can be computed as follows: = K Z k= r kz z= ln N k r P kz z, () ( θ ) k where z ndcates tem scores rangng from zero to the hghest tem score of Z, k represents the number of groups of examnees, r kz equals the observed number of examnees scorng z n group k, Nk s total number of examnees n group k, and P θ ) s the response z ( k functon for tem score z evaluated at the mean ablty of examnees n group k. The total number of groups K could however vary across tems because neghborng groups can be collapsed to avod expected values, N k Pz ( θ k ), less than 5. In general, the df of for dchotomous tems equals K whch s dfferent from that of Q Yen s as no adjustment for the number of estmated parameters m s made for. Mslevy and Bock (99) argued that m s not consdered n determnng the df for because the parameter estmaton procedure has nothng to do wth mnmzng. Orlando and Thssen s S-X Index for Dchotomous Items Though Orlando and Thssen s () S-X procedure follows much the same pattern as the procedure, t has a notable advantage over Q and. For both Q and, Q the groupng procedure reles on sample- and model- dependent cut scores, whereas S-X s based on test scores (.e., number-correct scores). Usng the same notaton defned earler, S-X for a dchotomous tem on an I-tem test s gven by:

10 5 S-X = I I ( Okz Ekz ) ( Ok Ek) k = N k k= z= Ekz k= Ek( Ek) N. (3) The summaton over the test score k n equaton (3) excludes k = and k = I, snce the proporton of examnees who respond to an tem correctly and have a test score of zero s always zero, and t s always for those havng a perfect test score of I. Also, n equaton (3), the expected proporton of examnees n x = k group who got tem rght, Ek, could be calculated usng the followng formula: P ( θ ) f ( k θ ) φ( θ ) θ Ek =, (4) f ( k θ ) φ( θ ) θ where f ( k θ ) s the condtonal predcted test score ( x = k ) dstrbuton gven θ, f ( k θ ) represents the condtonal predcted test score dstrbuton wthout tem, and φ (θ ) s the populaton dstrbuton of θ. f ( k θ ) and f ( k θ ) could be computed wth the help of the recursve algorthm developed by Lord and Wngersky (984). Smlar to, neghborng groups could be collapsed to mantan a mnmum expected cell frequency of 5. For S-X, the mnmum value s set to accordng to the studes of Larntz (978) and Orlando and Thssen s (). If t s not necessary to collapse groups, then the df equals I m where m s the number of tem parameters estmated; otherwse an adjustment for the number of groups beng collapsed s needed. The eneralzed S-X Index for Polytomous Items The current study extends the applcaton of S-X to the assessment of tem ft for polytomous tems. For a polytomous tem denoted on a test of I polytomous tems wth each havng Z + scorng categores (.e., category score z =,,, Z ), the generalzed S-X can, usng notaton defned earler, be expressed as follows:

11 6 S-X = ( F Z ) Z k = Z z= N k ( O kz E E kz kz ), (5) where z ndcates category scores, Z s the hghest score of tem, and F s a perfect test I score (.e., F = Z ). The expected category proportons, Ekz, n equaton (5), can be computed usng the followng formula: P ( z θ ) f ( k z θ ) φ( θ ) θ Ekz =. (6) f ( k θ ) φ( θ ) θ To compute f ( z θ ) and f ( z θ ) n equaton (6), the generalzed recursve algorthm developed by Thssen, Pommerch, Blleaud and Wllams (995) can be used. Smlar to that for dchotomous tems, the computaton for the generalzed S-X procedure excludes groups wth a test score of zero (.e., k = ) or perfect test score (.e., k = F). Also, n equaton (5), the summaton for k s from Z through F Z. Ths s because wthn some groups wth extremely low or hgh test scores, the expected proportons of examnees ( ) for some categores are always zero. For nstance, for the group wth k = E kz 38 on a test of polytomous tems wth each havng 5 categores (.e., z =,,, 3, 4) n Table, the observed and predcted proportons of examnees for the z = and z = categores are always zero. Smlarly, for the group wth k =, they wll always be zero for the z = 3 and z = 4 categores. For the generalzed S-X procedure, such groups are collapsed to the groups wth k = Z or k = F Z. For example, the groups wth k =, and 3 for the llustratve tem n Table wll be combned wth the k = 4 group, and the groups wth k = 37, 38, and 39 wll be merged wth the k = 36 group. Hence, for an tem wth Z +

12 7 scorng categores, the total number of groups ncluded n the generalzed S-X procedure s I equal to K = F Z + = Z Z +.

13 8 TABLE The Observed and Expected Cell Frequences of an Illustratve Item Observed Frequences ( N O k kz ) Expected Frequences( N E k kz ) Test Item Category Score Item Category Score Score roup k ndcates that the cell has a value of zero always. ndcates that the expected value s less than.5 but larger than

14 9 In lght of Orlando and Thssen (), neghborng test score groups need to be collapsed to mantan a mnmum expected category frequency ( N E k kz ) of. However, ths collapsng method could be nfeasble for the case of polytomous tems. For example, f Orlando and Thssen s cell collapsng approach s appled to the llustratve tem n Table, only two groups wth k = 8 and 9 would reman, and too much nformaton would be lost. An even worse scenaro would be that only one group remans. Thus, nstead of collapsng test score groups, the current study suggests, f needed, collapsng adjacent cells of tem score categores for a gven group k to ensure a mnmum expected category frequency of. For nstance, n the group wth k = of the llustratve tem n Table, the z = 3 and z = 4 categores could be combned wth the z = category. Pett (997) mentoned that cell-collapsng should be undertaken carefully so that the combned cells make ntutve sense. Snce the tem score categores are ordered, the suggested method s consdered reasonable. Also, Murak (996) used ths approach for combnng response categores. Wth ths collapsng algorthm, the generalzed S-X has df of K Z m C where m s the number of tem parameters estmated, and C ndcates the total number of tem score categores beng collapsed. For the llustratve tem wth 5 scorng categores n Table, the df = = 66 under the PCM. Also, the generalzed S-X for ths tem was found to be 6. (p-value =.6). So, the PCM appeared to ft the tem adequately. Method Desgn of Smulaton Study To assess the performance of the generalzed S-X ndex, a smulaton study varyng polytomous IRT models, test lengths, and sample szes was conducted. The ndex selected for comparson was PARSCALE s because t appears to be the most frequently employed ndex n appled settngs. Three commonly used polytomous IRT models (RSM,

15 PCM and PCM) were explored n ths study. Under the PCM, the probablty that an examnee j scores z wth z =,,, Z on tem wth Z + response categores s modeled by the followng: z y= c= [ θ ( β τ )] exp α j c c= P( z θ j, α, β, τ c ) = Z, (7) y exp α [ θ ( β τ )] j c where α s the dscrmnaton of tem, β denotes the dffculty of tem, and τ c represents the locaton parameter for a category on tem. Equaton (7) needs to set τ, = Z c= τ = and exp α [ θ ( β τ ] = c c= j c ) for model dentfcaton. The three polytomous IRT models consdered n ths study are herarchcally related. If α n equaton (7) s fxed at across tems, equaton (7) reduces to the PCM. Moreover, f theτ values for each category are, respectvely, the same across tems, equaton (7) further reduces to the RSM. Consequently, RSM s nested wthn PCM, whle PCM s nested wthn PCM. In addton to the three generatng models (RSM, PCM, and PCM), ths smulaton study employed three test lengths (I = 5,, and tems), and four sample szes (N = 5,,,,, and 5, examnees). The three test lengths mmc tests havng small, moderate and large numbers of polytomously scored tems. The four sample szes represent small, moderate, large and very large samples. Data eneraton A standard method was employed for tem response generaton for ths study. The steps for data generaton nclude: () generate tem and ablty parameters, () under the chosen IRT model, calculate the probablty, P( z θ, α, β, τ ), for the responses usng the j c

16 generated tem and ablty parameters, and then the cumulatve probablty denoted P ( z θ, α, β, τ ) = j c z x= P( x θ, α, β, τ j c ), and (3) wth a random number denoted u drawn from the unform dstrbuton, U(,), assgn a response of f u P (), or z f P ( z ) < u P ( z) for z =,,, or Z. For ths study, the tem parameters used for smulatng response data under the PCM were obtaned as follows. The dscrmnaton parameters (α ) were randomly sampled from a lognormal (,.5 ) dstrbuton. For each tem, four tem step parameters (.e., δ c = β τ where c =,, 3, or 4) were randomly drawn from four normal dstrbutons c wth a common standard devaton of.5 and means of -.5, -.5,.5, and.5, respectvely. The mean of these four step parameters s then used as the tem dffculty parameter ( β ), and the dfference between β and δ c s taken as τ c. Ths tem-parameter generatng procedure was repeated for all I tems. The values for the θ parameter were randomly drawn from the standard normal dstrbuton, N(,). Wth these tem and ablty parameters, a response dataset under the PCM was generated. The generatng procedure for datasets under the PCM s the same as that for the PCM except that the α parameters were fxed at. For generatng the dataset under the RSM, only one random sample of each step parameters ( τ c ) was generated and used for all tems, whle the dscrmnaton parameters for each tem were also fxed at. Fnally, there were a total of 36 dfferent condtons smulated n ths study (3 generatng models 3 test lengths 4 sample szes). One hundred replcatons were generated for each condton, and each condton mmcked dfferent I-tem tests from the same tem pool admnstered to equvalent groups of n examnees, respectvely.

17 Item Parameter Calbraton The tem and ablty parameters n each smulated dataset were calbrated usng the computer program PARSCALE (Murak & Bock, 997). Examples of the PARSCALE codes used for PCM, PCM, and RSM calbraton are gven n Appendx A. Parallelng Orlando and Thssen s () study, the number of tem parameters of the calbraton model (CM) was always fewer than or equal to that of the generatng model (M). As shown n Table, when the CM and M were the same, results were used to evaluate the emprcal Type I error rates of the generalzed S-X and PARSCALE s. And, when the CM was a smpler model than the M, results were used to calculate the emprcal power of the tem-ft ndces. TABLE Model Calbraton Desgn n the Smulaton Study Calbratng Model (CM) eneratng Model (M) RSM PCM PCM RSM Type I Error Rate Power Power PCM - Type I Error Rate Power PCM - - Type I Error Rate Ths study used a nomnal α of.5. An tem was flagged for msft f the sgnfcance level (.e., p-value) for the observed ft ndex under nvestgaton was less than.5. Under each condton, the smulated Type I error rates or power for the ft ndces were obtaned by dvdng the number of flagged tems by I upon the completon of the replcatons.

18 3 Results When the CM was same as the M, the PARSCALE calbratons ran properly and converged successfully for all study condtons. However, when the CM s a smpler model than the M, the calbraton runs dd not converge successfully for some condtons wth EM cycles. The worst stuaton occurred for the condton wth tems and, examnees under CM = PCM and M = PCM. For ths condton, 4 out of data sets were not successfully calbrated. In the current study, the problematc cases were excluded from the calculatons of emprcal power. The proportons of tems wrongly flagged for msft are shown n Table 3 for each condton. At a glance, the smulated Type I error rates of the generalzed S-X dd not change much across condtons, whle the values for the PARSCALE s dffered drastcally. In the short (5-tem) and medum (-tem) test length condtons, the Type I error rates of appeared to be severely nflated n most cases. Ths s not surprsng because t s a well-known problem for the tradtonal tem-ft ndces as mentoned earler. Also, even though the test length s as long as tems, had nflated Type I error rates n the 5, examnee condtons (.3 for the RSM,.8 for the PCM, and.74 for the PCM). In contrast to the poor performance of, the generalzed S-X dd not seem to be affected by test length, and had emprcal Type I error rates rangng from.4 and.65 for all study condtons except those wth long (-tem) test length and small sample sze of 5 examnees. In the long (-tem) test condtons, the performance of S-X appeared to mprove consstently as the sample sze ncreased from 5 to,, but ths pattern was not observed for.

19 4 TABLE 3 Type I Error Rates: Proportons of Indces wth Sgnfcance Level reater Than.5, Under M = CM Test Length and eneralzed S-X PARSCALE Sample Sze RSM PCM PCM RSM PCM PCM , , , , , , , , , Table 4 shows the emprcal power of the generalzed S-X and when the M s more complex than the CM. As expected, the greater the dfference that exsts between M and CM, the hgher the power s. When CM was RSM, the generalzed S-X was found to be more senstve n detectng msft wth M = PCM than wth M = PCM. Smlarly, for the condtons wth M = PCM, the condtons wth CM = RSM yeld hgher power than those wth CM = PCM. For all combnatons of test length and sample sze, the hghest power was found when M = PCM and CM = RSM. These fndngs reflect the nested structure of the three IRT models. Although both the generalzed S-X and showed better power as the sample sze ncreased, the power values for were not very useful because of the nflated Type I error rates of ths ndex under most study condtons.

20 5 TABLE 4 Emprcal Power: Proportons of Indces wth Sgnfcance Level reater Than.5, Under M > CM eneralzed S-X PARSCALE Test Length and Sample Sze PCM > PCM > PCM > PCM > PCM > PCM > RSM RSM PCM RSM RSM PCM , , , , , , , , , Dscusson As the emprcal Type I error rates of were found to be napproprately nflated, ts performance s not dscussed further n ths secton. The dscusson presented here focuses on ssues related to the generalzed S-X ndex. Ths secton begns wth a scrutny on the df adjustment for the number of estmated tem parameters followed by the dscusson on the performance of the generalzed S-X. The df Adjustment for the Number of Estmated Item Parameters For the applcatons of tem-ft ndces, controverses over the df adjustment for the number of estmated tem parameters have been found n the IRT lterature. For nstance, the df of Yen s Q s adjusted for the number of estmated tem parameters, whle the df of PARSCALE s s not adjusted. DeMars (5) explaned that the dsagreement comes from the dfference n tem parameter estmaton methods. Yen s Q s desgned for tem-ft analyss when the jont maxmum lkelhood (JML) method s used. In contrast, PARSCALE employs the margnal maxmum lkelhood (MML). Mslevy and Bock (99) mentoned, for the MML approach, the resduals are not under lnear constrants and there s

21 6 no loss of degrees of freedom due to the fttng of the tem parameters (p.-). Even though Orlando and Thssen (, 3) used the MML method wth the computer program MULTILO (Thssen, 99), they appled the df adjustment. An argument for usng the df adjustment was gven by Stone and Zhang (3): Regardless of tem parameter estmaton approach, the df adjustment should be used to account for the fact that expected frequences are based on estmated tem parameters (p. 347). Applyng the df adjustment to ther studes on S-X for dchotomous tems, both Orlando and Thssen () and Stone and Zhang (3) found emprcal Type I error rates to be close to the nomnal rates. Based upon these dscussons, a prelmnary nvestgaton was conducted to compare the applcatons of the generalzed S-X wth and wthout the df adjustment. The results showed no dramatc dfference n the smulated Type I error rates and power even though the cases wthout the df adjustment were slghtly more conservatve. Addtonally, snce the generalzed S-X was derved on the bass of Orlando and Thssen s S-X, the df adjustment was utlzed for the current study. Performance of the eneralzed S-X Smlar to Orlando and Thssen s () study on S-X for dchotomous tems, the current study found that the generalzed S-X exhbted emprcal Type I error rates rangng from.4 to.65 for all study condtons except that wth test length of tems and number of examnees equal to 5. For ths condton, the Type I error rates of the generalzed S-X appeared to be nflated as hgh as.95 for the RSM,.98 for the PCM, and.6 the PCM, as shown n Table 3. Ths could be explaned by the nevtable sparseness n expected frequences. ven that there are more score categores and hence more total score groups on an I polytomous tem test than on an I dchotomous tem test, the applcatons of generalzed S-X would encounter more sparseness n expected frequences

22 7 for condtons wth long tests (e.g., tems) and small sample szes (e.g., 5). Another ssue related to control of Type I errors s how close the samplng dstrbuton of the tem-ft statstcs under the null hypothess s to the theoretcal dstrbuton. To study the extent to whch the null dstrbuton of the S-X ndex followed a Ch-square dstrbuton, Orlando and Thssen () examned the frst two moments of the ndex. For a Ch-square dstrbuton, the mean and varance equal the df and df, respectvely. Also, Stone () used the technque of Q-Q plots to compare the emprcal dstrbuton of the ft statstcs of to the expected Ch-square dstrbuton wth an estmated df. These two methods were employed for examnng the emprcal dstrbuton of the generalzed S-X ndex. It s beleved that the generalzed S-X has approxmately a Ch-square dstrbuton. However, for a gven test length and a chosen IRT model, the df would vary due to the total number of cells beng collapsed. For example, the observed df ranged from 5 to 47 for the replcates under the condton wth test length of 5 tems and sample sze of 5, n the Type I error rate analyss under the PCM. From the 5 smulated tems (5 tems replcates) under ths condton, the observed mode of the df was 4 wth a frequency count of 65. The mean and varance of these 65 emprcal S-X values were found to be 4. and 8.99, respectvely. For other condtons, the two moments were also found to be close to ther expected values. In addton, Fgure presents the Q-Q plot to compare the emprcal dstrbuton of the 65 S-X observatons to a theoretcal Ch-square dstrbuton wth df = 4. The plot has a slope and ntercept close to and, respectvely.

23 8 FIURE. Q-Q Plot of Emprcal S-X Dstrbuton Compared to a Ch-Square Dstrbuton wth df = Expected Ch-square Value Observed Value The results of power analyss n a smulaton study for tem-ft statstcs are closely related to how to generate msfttng tems under the non-null cases. As shown n Table 4, the more dfferent the M and CM were, the better the observed power analyss results. Therefore, the generalzed S-X had the hghest power when the model ft of the RSM was nvestgated for the data smulated under the PCM. Also, for these condtons, a moderate sample sze of, would yeld adequate power. When the PCM was the M and the PCM was the CM, however, a very large sample sze (e.g., 5, examnees) was requred to yeld acceptable power hgher than.7 regardless of the test length. Smlarly, when the PCM was the M and the RSM was the CM, sample szes of, or hgher are needed for the generalzed S-X ndex to produce satsfactory power.

24 9 Concluson Smlar to the fndngs reported n the lterature, the results of the current study showed that the performance of appeared to be poor n most condtons wth short and moderate test lengths or very large sample szes (e.g., 5, examnees). In contrast, for such tests wth 5 or polytomous tems, the generalzed S-X showed superor performance n terms of Type I error rate control and power. Smlar performance pattern of the ndex was also found for the cases wth very large sample szes regardless of test length. Consequently, the generalzed S-X s a promsng ndex for nvestgatng tem-ft n educatonal and psychologcal assessments. To gan a better understandng of ths promsng tem-ft ndex, addtonal studes need to be conducted. Frst, wth the desgn of ths smulaton study, all generated tems were consdered msft for the power study. Followng Orlando and Thssen s (3) study, the senstvty of the generalzed S-X n detectng dfferent percentages of msft tems on a test form needs to be further studed. Second, the number of test score groups and hence the df are determned by the test length and the numbers of tem score categores. All of the smulated tems n the current study had a fxed number of score categores. Thus, studes on the mpact due to dfferent numbers of tem score categores could provde some nsght on the behavor of the ndex. Thrd, t s not rare for ablty dstrbutons to be non-normal for an educatonal or psychologcal assessment (Mccer, 989). The performance of the generalzed S-X needs to be nvestgated under condtons where the ablty dstrbuton s not normal (e.g., unform or skewed). Fnally, noteworthy behavor of S-X has been found for dchotomous and polytomous tems separately, yet t mght behave dfferently for mxed format tests because of format effect or multdmensonalty. So, further studes on the S-X ndex for mxed format test data are requred.

25

26 References Andrch, D. (978). Applcaton of a psychometrc model to ordered categores whch are scored wth successve ntegers. Appled Psychologcal Measurement,, Bock, R. D. (97). Estmatng tem parameters and latent ablty when responses are scored n two or more nomnal categores. Psychometrka, 37, 9-5. Bolt, D. M. (). A Monte Carlo comparson of parametrc and nonparametrc polytomous DIF detecton methods. Appled Measurement n Educaton, 5, 3-4. DeMars, C. E. (5). Type I error rates for PARSCALE s ft ndex. Educatonal and Psychologcal Measurement, 65, 4-5. Douglas, J., & Cohen, A. S. (). Nonparametrc tem response functon estmaton for assessng parametrc model ft. Appled Psychologcal Measurement, 5, las, C. A. W., & Suarez-Falcon, J. C. (3). A comparson of tem-ft statstcs for the three-parameter logstc model. Appled Psychologcal Measurement, 7, Hambleton, R. K., Swamnathan, H., & Rogers, H. J. (99). Fundamentals of tem response theory. Newbury Park, CA:Sage. Larntz, K. (978). Small-sample comparsons of exact levels for Ch-squared goodness-of-ft statstcs. Journal of the Amercan Statstcal Assocaton, 73, Lawley, D.N. (943). On problems connected wth tem selecton and test constructon. Proceedngs of the Royal Socety of Ednburgh, 6A, Lang, T., & Wells, C. S. (7). A model ft statstc for generalzed partal credt model. Paper presented at the annual meetng of the Natonal Councl on Measurement n Educaton, Chcago, IL. Lord, F.M. (95). A theory of test scores. Psychometrc Monograph, No. 7. Lord, F. M., & Wngersky, M. S. (984). Comparson of IRT true-score and equpercentle observed-score equatongs". Appled Psychologcal Measurement, 8, Masters,. N. (98). A Rasch model for partal credt scorng. Psychometrka, 47, McKnley, R. L., & Mlls, C. N. (985). A comparson of several goodness-of-ft statstcs, Appled Psychologcal Measurement, 9, Mccer, T. (989). The uncorn, the normal curve and other mprobable creatures. Psychologcal Bulletn, 5:, Mslevy, R. J., & Bock, R. D. (99). BILO: Item analyss and test scorng wth bnary logstc models [computer software]. Chcago, IL: Scentfc Software. Murak, E. (99). A generalzed partal credt model: Applcaton of an EM algorthm. Appled Psychologcal Measurement, 6, Murak, E. (996). A generalzed partal credt model. In Van der Lnden, W. J., & Hambleton, R. K. (Eds.), Handbook of modern tem response theory. New York: Sprnger.

27 Murak, E. & Bock, R. D. (997). PARSCALE: IRT tem analyss and test scorng for ratngscale data [Computer software]. Chcago: Scentfc Software. Orlando, M., & Thssen, D. (). New tem ft ndces for dchotomous tem response theory models. Appled Psychologcal Measurement, 4, Orlando, M., & Thssen, D. (3). Further nvestgaton of the performance of S-X : An tem ft ndex for use wth dchotomous tem response theory models. Appled Psychologcal Measurement, 7, Pett, M. A. (997). Nonparametrc statstcs for health care research: Statstcs for small samples and unusual dstrbutons. Thousand Oaks, CA: Sage Publcaton, Inc. Roberts, J. S. (n press). Modfed lkelhood-based tem ft statstcs for the generalzed graded unfoldng model. Appled Psychologcal Measurement. Rupp, A. A., & Zumbo, B. D. (4). A note on how to quantfy and report whether IRT parameter nvarance holds: When Pearson correlatons are not enough. Educatonal and Psychologcal Measurement, 65, Samejma, F. (969). Estmaton of latent ablty usng a response pattern of graded scores. Psychometrka Monograph, 7. Shepard, L., Camll,., & Wllams, D. M. (984). Accountng for statstcal artfacts n tem bas research. Journal of Educatonal Statstcs, 9, Snharay, S. (3). Bayesan tem analyss for dchotomous tem response theory models (ETS RR-3-34). Retreved May, 5, from the ETS Web ste Snharay, S. (5). Bayesan tem ft analyss for undmensonal tem response models. Paper presented at the annual meetng of the Natonal Councl on Measurement n Educaton, Montreal, Canada. Stone, C. A. (). Monte Carlo based null dstrbuton for an alternatve goodness-of-ft test statstc n IRT models. Journal of Educatonal Measurement, 37, Stone, C. A., & Hansen, M. A. (). The effect of errors n estmatng ablty on goodnessof-ft tests for IRT models. Educatonal and Psychologcal Measurement, 6, Stone, C. A., & Zhang, B. (3). Assessng goodness of ft of tem response theory models: a comparson of tradtonal and alternatve procedures. Journal of Educatonal Measurement, 4, Suarez-Falcon, J. C., & las, C. A. W. (3). Evaluaton of global testng procedures for tem ft to the Rasch model. Brtsh Journal of Mathematcal and Statstcal Psychology, Thssen, D. (99). MULTILO: Multple, categorcal tem analyss and test scorng usng tem response theory. Chcago, IL: Scentfc Software. [Computer Program.] Thssen, D., Pommerch, M., Blleaud, K., & Wllams, V. (995). Item response theory for scores on tests ncludng polytomous tems wth ordered responses. Appled Psychologcal Measurement, 9, Wells, C. S. (4). Investgaton of model msft n IRT and a new approach based on smultaneous parametrc and nonparametrc IRT estmaton. Unpublshed doctoral dssertaton. Unversty of Wsconsn-Madson.

28 Yen, W. M. (98). Usng smulaton results to choose a latent trat model. Appled Psychologcal Measurement, 5,

29 4

30 5 Appendx A Examples of the PARSCALE Codes Used for PCM, PCM and RSM Calbraton < A-: PCM > >FILE DFNAME='ca.dat', SAVE; >SAVE PARM='gpgpca.par'; >INPUT NIDW=3, NTOTAL=, NTEST=, LENTH(), NFMT=; (3A,T,A) >TEST TNAME=gpca, ITEM=(()), NBLOCK=; >BLOCK BNAME=SBLOCK, NITEMS=, NCAT=5, SCORIN=(,,3,4,5), REPEAT = ; >CALIB PARTIAL, LOISTIC, CYCLES=(,,,), SCALE=., NQPTS=4, ITEMFIT=; >SCORE NOSCORE; < A-: PCM > >FILE DFNAME='ca.dat', SAVE; >SAVE PARM='pgpca.par'; >INPUT NIDW=3, NTOTAL=, NTEST=, LENTH(), NFMT=; (3A,T,A) >TEST TNAME=gpca, ITEM=(()), NBLOCK=; >BLOCK BNAME=SBLOCK, NITEMS=, NCAT=5, SCORIN=(,,3,4,5), REPEAT = ; >CALIB PARTIAL, LOISTIC, CYCLES=(,,,), SCALE=., NQPTS=4, ITEMFIT=,SPRIOR, PRIORREAD; >PRIORS SMU=(()), SSIMA=(.()); >SCORE NOSCORE; < A-3: RSM > >FILE DFNAME='ca.dat', SAVE; >SAVE PARM='rgpca.par'; >INPUT NIDW=3, NTOTAL=, NTEST=, LENTH(), NFMT=; (3A,T,A) >TEST TNAME=gpca, ITEM=(()), NBLOCK=; >BLOCK BNAME=SBLOCK, NITEMS=, NCAT=5, SCORIN=(,,3,4,5); >CALIB PARTIAL, LOISTIC, CYCLES=(,,,), SCALE=., NQPTS=4, ITEMFIT=,SPRIOR, PRIORREAD; >PRIORS SMU=(()), SSIMA=(.()); >SCORE NOSCORE;