A Bayesian Approach to Person Fit Analysis in Item Response Theory Models

Save this PDF as:

Size: px
Start display at page:

Download "A Bayesian Approach to Person Fit Analysis in Item Response Theory Models"

Transcription

1 A Bayesian Approach to Person Fit Analysis in Item Response Theory Models Cees A. W. Glas and Rob R. Meijer University of Twente, The Netherlands A Bayesian approach to the evaluation of person fit in item response theory (IRT) models is presented. In a posterior predictive check, the observed value on a discrepancy variable is positioned in its posterior distribution. In a Bayesian framework, a Markov chain Monte Carlo procedure can be used to generate samples of the posterior distribution of the parameters of interest. These draws can also be used to compute the posterior predictive distribution of the discrepancy variable. The procedure is worked out in detail for the three-parameter normal ogive model, but it is also shown that the procedure can be directly generalized to many other IRT models. Type I error rate and the power against some specific model violations are evaluated using a number of simulation studies. Index terms: Bayesian statistics, item response theory, person fit, model fit, 3-parameter normal ogive model, posterior predictive check, power studies, Type I error. Applications of item response theory (IRT) models to the analysis of test items, tests, and item score patterns are only valid if the IRT model holds. Fit of items can be investigated across persons, and fit of persons can be investigated across items. Item fit is important because in psychological and educational measurement, instruments are developed that are used in a population of persons; item fit then can help the test constructor to develop an instrument that fits an IRT model in that particular population. Item-fit statistics have been proposed by, for example, Mokken (1971), Andersen (1973), Yen (1981, 1984), Molenaar (1983), Glas (1988, 1999), and Orlando and Thissen (2000). As a next step, the fit of an individual s item score pattern can be investigated. Although a test may fit an IRT model, persons may produce patterns that are unlikely given the model, for example, because they have preknowledge of the correct answers to some of the most difficult items. Investigation of person fit may help the researcher to obtain additional information about the answering behavior of a person. By means of a person-fit statistic, the fit of a score pattern can be determined given that the IRT model holds. Some statistics can be used to obtain information at a subtest level, and a more diagnostic approach can be followed. Meijer and Sijtsma (1995, 2001) have given an overview of person-fit statistics proposed for various IRT models. To decide whether an item score pattern fits an IRT model, a sampling distribution under the null model, that is, the IRT model, is needed. Let t be the observed value of a person-fit statistic T. Then the significance probability or probability of exceedance is defined as the probability under the sampling distribution that the value of the test statistic is equal or smaller than the observed value, that is, p = P(T t), or equal or larger than the observed value, that is, p = P(T t), depending on whether low or high values of the statistic indicate aberrant item score patterns. As will be discussed below, for some statistics, theoretical asymptotic or exact distributions are known, which can be used to classify an item score pattern as fitting or nonfitting. An alternative is to simulate data according to an IRT model based on the estimated item parameters and then determine p empirically (e.g., Meijer & Nering, 1997; Reise, 1995, 2000; Reise & Widaman, 1999). However, Applied Psychological Measurement, Vol. 27 No. 3, May 2003, DOI: / Sage Publications 217

3 C. A. W. GLAS and R. R. MEIJER BAYESIAN PERSON FIT INDICES 219 P j (θ) = γ j + (1 γ j ) (α j θ β j ), (1) where denotes the standard normal cumulative distribution. To investigate the goodness of fit of item score patterns, several IRT-based person-fit statistics have been proposed. Most person-fit statistics have the form V(θ)= k [ Yj P j (θ) ]2 v j (θ), (2) where Y j is the response to item j, where the weight v j (θ) is often defined as an increasing function of the likelihood of the observed item scores. So the test is based on the discrepancy between the observed scores Y j and the expected scores under the model, P j (θ). A straightforward example of a member of the class defined by (2) is the W-statistic by Wright and Stone (1979), which is defined as k [Y j P j (θ)] 2 W =. (3) k P j (θ)[1 P j (θ)] A related statistic was proposed by Smith (1985, 1986) where the set of test items is divided into S nonoverlapping subtests denoted A s (s = 1,..., S). Then the unweighted between-sets fit statistic UB is defined as [ [Y j P j (θ)] ] 2 UB = 1 S j As S 1 P j (θ) [ 1 P j (θ) ]. (4) s=1 j As Other obvious members of the class defined by (2) are two statistics proposed by Tatsuoka (1984): ζ 1 and ζ 2. The ζ 1 -statistic is the standardization with a mean of 0 and unit variance of k ζ = 1 [P j (θ) Y j ](n j n j ), (5) where n j denotes the number of correct answers to item j and n j denotes the mean number of correctly answered items in the test. The index will be positive, indicating misfitting response behavior when easy items are incorrectly answered and difficult items are correctly answered, and it will also be positive if the number of correctly answered items deviates from the overall mean score of the respondents. If a response pattern is misfitting in both ways, the index will obtain a large positive value. The ζ 2 -statistic is a standardization of k ζ = 2 [P j (θ) Y j ][P j (θ) R/k] (6) where R is the person s number-correct score on the test. This index is sensitive to item score patterns with correct answers to difficult items and incorrect answers to easy items; the overall response tendencies of the total sample of persons is not important here.

4 Volume 27 Number 3 May APPLIED PSYCHOLOGICAL MEASUREMENT Another well-known person-fit statistic is the log-likelihood statistic l = k {Y j log P j (θ) + (1 Y j ) log[1 P j (θ)]}, (7) first proposed by Levine and Rubin (1979). It was further developed in Drasgow, Levine, and Williams (1985) and Drasgow, Levine, and McLaughlin (1991). Drasgow et al. (1985) proposed a standardized version l z of l that was purported to be asymptotically standard normally distributed; l z is defined as l z = l E (l), (8) [Var(l)] 2 1 where E (l) and Var(l) denote the expectation and the variance of l, respectively. These quantities are given by E (l) = k { Pj (θ) log [ P j (θ) ] +[1 P j (θ)] log [ 1 P j (θ) ]}, (9) and Var(l)= k [ ] P j (θ) [1 P j (θ)] log P j (θ) 2 1 P j (θ). (10) It can easily be shown that l E(l) can be written in the form of Equation 2 by choosing ( ) Pj (θ) v j (θ) = log. (11) 1 P j (θ) The assessment of person fit using statistics based on V(θ)(Formula 2) is usually contaminated with the estimation of θ. If ˆθ is an estimate of θ, then the distributions of the person-fit statistics using V(ˆθ)and V(θ)will differ. For example, Molenaar and Hoijtink (1990) showed that the distribution of l z differs substantially from the standard normal distribution for short tests. Snijders (2001) derived expressions for the first two moments of the distribution: E[V(ˆθ)] and Var[V(ˆθ)] to define a correction factor. Note that this correction factor applies to all the statistics based on V(θ)defined above. With respect to the special case of l z, Snijders performed a simulation study for relatively small tests consisting of 8 and 15 items and for large tests consisting of 50 and 100 items, fitting the 2PL model, and estimating θ by maximum likelihood. The results showed that the correction was satisfactory at Type I error levels of α =.05 and α =.10 but that the empirical Type I error was smaller than the nominal Type I error for smaller values of α. In fact, both the distribution of l z and the version of l z corrected for ˆθ, denoted l z, are negatively skewed (Snijders, 2001; van Krimpen- Stoop & Meijer, 1999). This skewness influences the difference between nominal and empirical Type I error rates for small Type I error values. For example, Snijders (see also van Krimpen-Stoop and Meijer, 1999) found that for a 50-items test at α =.05, the discrepancy between the nominal and the empirical Type I error for l z and l z at θ = 0 was small (.001), whereas for α =.001 for both statistics, it was larger (approximately.005). van Krimpen-Stoop and Meijer (1999) found that increasing the item discrimination resulted in a distribution that was more negatively skewed. An alternative may be to use a χ 2 -distribution; statistical theory that incorporates the skewness of the distribution is not yet available, however, for the 2PL and 3PL models.

5 C. A. W. GLAS and R. R. MEIJER BAYESIAN PERSON FIT INDICES 221 Examples of person-fit tests outside the class defined by (2) are the uniformly most powerful (UMP) tests by Klauer (1991, 1995; see also Levine & Drasgow, 1988). Klauer s approach entails a UMP test for testing whether a person s item score pattern complies with the Rasch (1960) model against a specific alternative model that can be viewed as a generalization of the Rasch model. The statistic that forms the basis of a UMP test is the sufficient statistic for the parameters that have to be added to the null model (in this case, the Rasch model) to define the alternative model. For example, consider the test of the Rasch model against an alternative model where the ability parameter differs between subtest A 1 and subtest A 2. Let θ 1 be the individual s ability on subtest A 1 and let θ 2 be the individual s ability on subtest A 2. Furthermore, consider the number-correct score on the first and second subtest, respectively, and let δ = θ 1 θ 2. Then H 0 : δ = 0 can be tested against H 1 : δ = 0 using the number-correct score on either one of the subtests. Note that in an IRT model, it is assumed that for each person, the latent trait is invariant across items; if this is not the case, this may point at aberrant response behavior. In contrast to, for example, calculating the log-likelihood as given in Equation 7 or 8, we now explicitly test against an alternative hypothesis. So when the null hypothesis is rejected for a particular person, this person can be classified as aberrant. The statistical test if the total score on the first subtest is too high compared to the model is denoted as T 1, and the statistical test if the test score on the second subtest is too high compared to the model is denoted as T 2. As another example, Klauer (1991, 1995) proposed a person-fit test for violation of the assumption of local independence using an alternative model proposed by Kelderman (1984; see also Jannarone, 1986) where the probability of a response pattern (y 1,...,y j,...,y k ) is given by [ k ] k 1 P(y 1,...,y j,...,y k θ) exp y j (θ β j ) + y j y j+1 δ. (12) Note that δ models the dependency between y j and y j+1.ifδ = 0, the model equals the Rasch model. A UMP test denoted as T lag of the null hypothesis δ = 0 can be based on a sufficient statistic with realizations k 1 y jy j+1. When the item discrimination parameters α j are considered known, the principle of the UMP test can also be applied to the 2PL model. Analogous UMP tests for the 3PL model and the normal ogive model cannot be derived because these models have no sufficient statistic for θ. Even though UMP tests do not exist for these models, the notion of using statistics related to the parameters of an alternative model as a basis of a test is intuitively appealing. Therefore, the generalizations of these tests to the 3PNO model in a Bayesian framework will also be studied below. Bayesian Estimation of the 3PNO Model In this study, an MCMC procedure will be used to generate the posterior distributions of interest. The MCMC chains will be constructed using the Gibbs sampler (Gelfand & Smith, 1990). To implement the Gibbs sampler, the parameter vector is divided into a number of components, and each successive component is sampled from its conditional distribution given sampled values for all other components. This sampling scheme is repeated until the sampled values form stable posterior distributions. Albert (1992; see also Baker, 1998) applied Gibbs sampling to estimate the parameters of the well-known 2PNO model (e.g., Lord & Novick, 1968). Johnson and Albert (1999, sec. 6.9) generalized the procedure to the 3PNO. For application of the Gibbs sampler, it is important to create a set of partial posterior distributions that are easy to sample from. This often involves the data augmentation, that is, the introduction of additional latent variables that lead to a simple set of

6 Volume 27 Number 3 May APPLIED PSYCHOLOGICAL MEASUREMENT posterior distributions. In the Gibbs sampling algorithm, these latent variables are sampled along with the variables of interest. The present procedure is based on two data augmentation steps. The first step entails the introduction of binary variables W ij such that { 1 W ij = 0 if person i knows the correct answer to item j if person i does not know the correct answer to item j. (13) So if W ij = 0, person i guessed the response to item j;ifw ij = 1, person i knows the right answer and gives a correct response. The relation between W ij and the observed response variable Y ij is given by a model where (η ij ), with η ij = α j θ i β j, is the probability that the respondent knows the item and gives a correct response with probability one, and a probability (1 (η ij )) that the respondent does not know the item and guesses with γ j as the probability of a correct response. So the probability of a correct response is a sum of a term (η ij ) and a term γ j (1 (η ij )). Summing up, we have P(W ij = 1 Yij = 1,η ij,γ j ) (η ij ) P(W ij = 0 Yij = 1,η ij,γ j ) γ j (1 (η ij )) (14) P(W ij = 1 Yij = 0,η ij,γ j ) = 0 P(W ij = 0 Yij = 0,η ij,γ j ) = 1. The second data augmentation step is derived using a rationale that is analogous to a rationale often used as a justification of the 2PNO (see, for instance, Lord, 1980, sec. 3.2). In that rationale, it is assumed that if person i is presented item j, a latent variable Z ij is drawn from a normal distribution with mean η ij and a variance equal to one. A correct response y ij = 1 is given when the drawn value is positive. Analogously, in the present case, a variable Z ij is introduced with a distribution defined by Z ij Wij = w ij { N(ηij,1) truncated at the left by 0 N(η ij,1) truncated at the right by 0 if W ij = 1 if W ij = 0. (15) The item parameters α have a prior p(α, β) = k I(α j > 0), which ensures that the discrimination parameters are positive. Note that this prior is uninformative with respect to β. The guessing parameter γ j has the conjugate prior Beta(a, b). The ability parameters θ have a standard normal distribution, that is, µ = 0 and σ = 1. The procedure described below is based on the Gibbs sampler. The aim of the procedure is to simulate samples from the joint posterior distribution of α, β,γ,θ,z and w, given the data y, which are the responses of n test takers to k items. This posterior distribution is given by p(α, β,γ,θ,z, w y ) = p(z, w y ; α, β,γ,θ)p(γ)p(α, β)p(θ) n k = C p(z wij ij,η ij )p(w yij ij,η ij,γ j )p(γ j )I(α j > 0) i=1 φ(θ i ; µ = 0,σ = 1), (16) where p(w ij yij,η ij,γ j ) is given by Equation 14 and p(z ij wij,η ij ) follows from Equation 15. Although the distribution given by Equation 16 has an intractable form, as a result of the two data augmentation steps, the conditional distributions of α, β,γ,θ,z and w are now each tractable and

7 C. A. W. GLAS and R. R. MEIJER BAYESIAN PERSON FIT INDICES 223 easy to sample from. A draw from the full conditional distribution can be obtained in the following steps. Step 1. The posterior p(z, w y ; α, β,γ, γ,θ) θ is factored as p(z y ; w,α,β β,γ, γ,θ) θ p(w y ;α, β,γ, γ,θ), θ and values of w and z are drawn in two substeps: Draw w ij from the distribution of W ij conditional on the data y and α, β,γ, γ,θ,given by Equation 14. Draw z ij from the conditional distribution of Z ij given all other variables using Equation 15, Step 2. Draw from the conditional distribution of θ given the values z, w,α,β β, γ, and y. Since p(θ y ; z, w,α,β β,γ, γ,θ) θ is proportional to p(z θ α,β β)p(θ) θ)p(γ,w, y z,α,β β), and the last term also does not depend on θ, it follows from the definition of Z ij given above that the error term ε ij in Z ij + β j = α j θ i + ε ij is a normally distributed. So the full-conditional distribution of θ entails a normal model for the regression of Z ij + β j on α j, with θ i as a regression coefficient which has a normal prior with parameters µ = 0 and σ = 1. (see, for instance, Gelman, et al., 1995, p. 45 and p. 78) Step 3. Draw from the conditional distribution of the parameters of item j, α j, and β j. Analogous to the previous step, also this step entails sampling from a regular normal linear model. Defining Z j = (Z 1j,...,Z nj ) T, and X =(θ, 1), with 1 being the n dimensional column vector with elements -1, the two items parameters can be viewed as regression coefficients in Z j = X(α j,β j ) T +ε, where ε is a vector of random errors. So also this step boils down to sampling the regression coefficients in a regular Bayesian linear regression problem. Step 4. Sample from the conditional distribution of γ j. The likelihood of w 1j,...,w nj is a binomial with parameter γ j. With the noninformative conjugate Beta prior introduced above, the posterior distribution of γ j also follows a beta distribution (see, for instance, Gelman et al., 1995, sec. 2.1). So the procedure boils down to iteratively generating a number of sequences of parameter values using these four steps. Convergence can be evaluated by comparing the between- and withinsequence variance (see, for instance, Gelman et al., 1995). Starting points of the sequences can be provided by the Bayes modal estimates of BILOG-MG (Zimowski, Muraki, Mislevy, & Bock, 1996). For more information on this algorithm, refer to Albert (1992), Baker (1998), and Johnson and Albert (1999). In the Bayesian approach, the posterior distribution of the parameters of the 3PNO model, say p(ξ y), is simulated using an MCMC method proposed by Johnson and Albert (1999). Person fit will be evaluated using a posterior predictive check based on an index T(y,ξ). When the Markov chain has converged, draws from the posterior distribution can be used to generate model-conform data y rep and to compute a so-called Bayes p value defined by Pr(T (y rep,ξ) T(y,ξ) y). (17) So person-fit is evaluated by computing the relative proportion of replications, that is, draws of ξ from p(ξ y), where the person-fit index computed using the data, T(y,ξ), has a smaller value than the analogous index computed using data generated to conform to the IRT model, that is, T(y rep,ξ). Posterior predictive checks are performed by inserting the person-fit statistics given in the previous section into Equation 17. After the burn-in period, when the Markov Chain has converged, in every nth iteration (n 1), using the current draw of the item and person parameters, a person-fit index T(y,ξ)is computed, a new model-conform response pattern is generated, and a

8 Volume 27 Number 3 May APPLIED PSYCHOLOGICAL MEASUREMENT value T(y rep,ξ)is computed. Finally, a Bayesian p value is computed as the proportion of iterations where T(y rep,ξ) T(y,ξ). Simulation Studies The simulation study consisted of two parts. In the first part, the Type I error rate as a function of test length and sample size was investigated. In the second part, the detection rates of the different statistics for different model violations, test lengths, and sample sizes were investigated. The impact of nonfitting item scores on the bias in θ as a function of the number of test items affected by lack of model fit was also investigated. In all reported simulation studies, the statistics l, W, UB, ζ 1, ζ 2, T lag, T 1, and T 2 are used as defined above. Study 1: Type I Error Rate Method The simulation studies with respect to the Type I error rate were performed in two conditions. In the first, the data were generated using randomly drawn item parameters; in the second, the data were generated using fixed item parameters. In both conditions, the ability parameters were drawn from a standard normal distribution. In the first condition, for every replication, the item parameters were drawn from the default prior distributions used in BILOG-MG. The guessing parameter γ was drawn from a Beta(a, b) distribution with a and b equal to 5 and 17, respectively. This resulted in a mean γ of.20. Furthermore, the item discrimination parameters were drawn from a log-normal distribution with mean 0 and a variance equal to.5, and the item difficulty parameters β were drawn from a normal distribution, also with mean 0 and variance.50. In the second condition, the item parameters were fixed. The γ was fixed to.20 for all items. Item difficulty and discrimination parameters were chosen as follows: for a test length k = 30, three values of the discrimination parameter, 0.5, 1.0, and 1.5, were crossed with 10 item difficulties β i = (i 1), i = 1,...,10. for a test length k = 60, three values of discrimination parameters, 0.5, 1.0, and 1.5, were crossed with 20 item difficulties β i = (i 1), i = 1,...,20. Three samples sizes were used: n = 100, n = 400, and n = 1,000. The true values of the parameters were used as starting values for the MCMC procedure. The procedure had a run length of 4,000 iterations with a burn-in period of 1,000 iterations. That is, the first 1,000 iterations were discarded. In the remaining 3,000 iterations, T(y rep,ξ) and T(y,ξ)were computed every 5 iterations. So the posterior predictive checks were based on 600 draws. For the statistics that use a partitioning of the items into subtests, the items were ordered according to their item difficulty β and then two subtests of equal size were formed, one with the difficult and one with the easy items. Finally, for every condition, 100 replications were simulated, and the proportion of replications with a Bayesian p value less than.05 was determined. Results The results for the condition with random item parameters are shown in Table 1; the results for the condition with fixed item parameters are shown in Table 2. It can be seen that in general, the significance probabilities converge to their nominal value of.05 as a function of sample size and test length, and the nominal significance probability was best approximated by the combination of a test length k = 60 and a sample size n = 400 or n = 1,000. Note that for n = 100, the significance

9 C. A. W. GLAS and R. R. MEIJER BAYESIAN PERSON FIT INDICES 225 Table 1 Actual Type I Error Rates for a Nominal α =.05 Test Random Item Parameters k = 30 k = 60 n = 100 n = 400 n = 1,000 n = 100 n = 400 n = 1,000 l W UB ζ ζ T lag T T MAE Note. MAE = mean absolute error. probabilities were much too large. There are no clear effects for specific person-fit statistics, except that the UB and the T lag seemed to be quite conservative for n = 400 and n = 1,000 and random item parameter selection. Finally, at the bottom of the two tables, the mean over replications and simulees of the absolute difference between the true and the estimated ability parameters is given. This mean absolute error (MAE) was used to interpret the bias in the ability parameters of simulees with nonfitting response vectors in the simulation study. Table 2 Actual Type I Error Rates for a Nominal α =.05 Test Fixed Item Parameters k = 30 k = 60 n = 100 n = 400 n = 1,000 n = 100 n = 400 n = 1,000 l W UB ζ ζ T lag T T MAE Note. MAE = mean absolute error. Study 2: Detection Rates Method Guessing. In several studies, problems were discussed of unmotivated persons who take a test in which they do not have personal interest. For example, Schmitt, Cortina, and Whitney (1993) noted the potential for suspicion and disdain among employees in a concurrent validation study; the authors predicted that factors including poor motivation and cheating may lead to inaccurate assessment of abilities for some employees. In such testing conditions, persons may guess the

11 C. A. W. GLAS and R. R. MEIJER BAYESIAN PERSON FIT INDICES 227 in particular item preknowledge on the items of median difficulty and on the most difficult items may have an effect on the total score. Thus, the effect of item preknowledge will be important in particular for persons with a low ability level who answer many difficult items correctly. The setup of the simulation study to the detection rate of the tests for item disclosure was analogous to the study to the detection rate for guessing. So data were generated for sample sizes of n = 400 and n = 1,000 simulees and test lengths of k = 30 and k = 60 items; item disclosure was prominent for 10% of the simulees; and for these simulees, 1/6, 1/3, or1/2 of the difficult items in the test were corrupted. The probability of a correct response to these items was chosen to be.80. Test statistics were computed in the same way as in the guessing study, except for T 1 and T 2, which were now right-tailed as all other statistics in the study. That is, both statistics were designed to detect scores that were too high. Again, 50 replications were simulated in every condition. Violations of local independence. When previous items provide new insights useful for answering the next item or when the process of answering items is exhausting, the assumption of local independence may be violated. This may result, for example, from speeded testing situations or in situations where there is exposure to material among students (Yen, 1993; see also Embretson & Reise, 2000, pp ). The setup of the simulation study to the detection rate of the tests for violation of local independence was analogous to the studies of the detection rate of guessing and item disclosure. Again, data were generated for sample sizes of n = 400 and n = 1,000 simulees and test lengths of k = 30 and k = 60 items; the model violation was imposed on 10% of the simulees; and for these simulees, 1/6, 1/3, or1/2 of the test was corrupted. Responses to corrupted items were generated with the model defined by (12), with δ = 1.0. In these simulations, the items were ordered such that the affected items succeeded each other. For the condition where 1/3 of the test was corrupted, the model violation was imposed on the items with α = 1.0. For the condition where 1/6 of the test was corrupted, the model violation was imposed on the items with α = 1.0 and the lowest item difficulties. For the condition where 1/2 of the test was corrupted, the model violation was imposed on the items with α = 1.0 and, in the case where there were too few items with α = 1, on the items with α =.5 and the lowest item difficulties. The impact of a violation with δ = 1.0 was an average increase in lag, k 1 y jy j+1, of 1.6, 4.0, and 5.9 for a test of 60 items with 1/6, 1/3, or1/2 of the items corrupted, respectively, and of 0.7, 1.5, and 2.4 for a test of 30 items with 1/6, 1/3, or1/2 of the items corrupted, respectively. Test statistics were computed in the same way as in the study of item disclosure reported above, with the exception that for T 1 the focus was on higher than expected outcomes. Again, 50 replications were made in every condition. Results Guessing. The proportions of hits, that is, the proportion of correctly identified aberrant simulees, are shown in Table 3. The proportions of false alarms, that is, the proportion of normal simulees incorrectly identified as aberrant, are shown in Table 4. The optimal condition for the detection of guessing was a large sample size and a large test length. Therefore, the results of the condition with n = 1,000 simulees and k = 60 items is discussed first. The main overall trend for all tests was that the detection rate decreased as the number of affected items increased. This can be explained by the inflated MAE of θ for the misfitting simulees (bottom of Table 3). It can be seen that the MAE for the misfitting simulees was grossly inflated, where the MAE was larger for p = 1/2 and p = 1/3 than for p = 1/6. Comparing these results with the results in Table 1, it can also be concluded that the presence of 10% misfitting simulees in the calibration sample affected the MAE for the fitting simulees to some degree. As the number of affected items

12 Volume 27 Number 3 May APPLIED PSYCHOLOGICAL MEASUREMENT Table 3 Detection Rate for Guessing Simulees k = 30 k = 60 n = 400 n = 1,000 n = 400 n = 1,000 p = 1/6 p = 1/3 p = 1/2 p = 1/6 p = 1/3 p = 1/2 p = 1/6 p = 1/3 p = 1/2 p = 1/6 p = 1/3 p = 1/2 l W UB ζ ζ T lag T T MAE normal MAE aberrant Note. MAE = mean absolute error. Table 4 False Alarm Rate for Guessing Simulees k = 30 k = 60 n = 400 n = 1,000 n = 400 n = 1,000 p = 1/6 p = 1/3 p = 1/2 p = 1/6 p = 1/3 p = 1/2 p = 1/6 p = 1/3 p = 1/2 p = 1/6 p = 1/3 p = 1/2 l W UB ζ ζ T lag T T Note. MAE = mean absolute error. increased, the MAE also increased, and because the fit statistics were computed conditionally on ˆθ, the detection rate decreased. Inspection of the results in the condition with n = 400 simulees and k = 60 shows that the detection rate was little affected by the smaller calibration sample. Furthermore, note that the detection rate of T 1 was lower than the detection rate of T 2. So the bias in ˆθ was such that the low scores on the first part of the test were less unexpected than the relatively high scores on the second part of the test. For a test length of k = 30 items, the detection rate was slightly less than for k = 60 items. This was as expected, because the statistics were computed on an individual level, and on this level, the test length was the number of observations on which the test was based. Note that the relatively low detection rates of T lag and T 1 found for k = 60 also applied for k = 30. Finally, at the bottom of the table it can be seen that the MAE was less inflated than for the study with k = 60. The explanation is that the absolute numbers of affected items that were responded to was lower here. Item disclosure. The proportions of hits and false alarms are shown in Tables 5 and 6, respectively. It can be concluded that the effects of test length and proportion of affected items are also found here. Furthermore, the absence of an effect of calibration sample size was replicated. The detection rates of T lag were relatively low.

13 C. A. W. GLAS and R. R. MEIJER BAYESIAN PERSON FIT INDICES 229 Table 5 Detection Rate for Item Disclosure k = 30 k = 60 n = 400 n = 1,000 n = 400 n = 1,000 p = 1/6 p = 1/3 p = 1/2 p = 1/6 p = 1/3 p = 1/2 p = 1/6 p = 1/3 p = 1/2 p = 1/6 p = 1/3 p = 1/2 l W UB ζ ζ T lag T T MAE normal MAE aberrant Note. MAE = mean absolute error. Table 6 False Alarm Rate for Item Disclosure k = 30 k = 60 n = 400 n = 1,000 n = 400 n = 1,000 p = 1/6 p = 1/3 p = 1/2 p = 1/6 p = 1/3 p = 1/2 p = 1/6 p = 1/3 p = 1/2 p = 1/6 p = 1/3 p = 1/2 l W UB ζ ζ T lag T T Note. MAE = mean absolute error. Remember that now the items in the second part of the test, that is, the difficult items, were affected by the model violations. Therefore, it was expected that T 2 would be sensitive to the increase in the total score on the second half of the test. Table 5 shows that this expectation was confirmed (e.g., detection rates between.24 and.50 for n = 1,000 and k = 30). However, note that for k = 30, the detection rate of T 1 was also relatively large (between.27 and.30) with, contrary to T 2, a high false alarm rate (between.27 and.29). Thus, the bias of the ability estimate caused by the model violation was large enough to affect the simulation of the predictive distribution of T 1. In practice this is undesirable, because one does not know a priori which part of the test is affected, and the interpretation of the outcome of T 1 and T 2 is problematic. Violations of local independence. The proportions of hits and false alarms are shown in Tables 7 and 8, respectively. Although for the aberrant simulees, the increase in lag reported above was considerable, in the two bottom lines of Table 7, it can be seen that the resulting bias in their ability estimates was not impressive. Also, the detection rate of the tests was negligible. Even the power of the T lag -test, which was specially targeted at this model violation, was negligible. The only exception was the T 1 -test. The reason was that the affected items were placed at the beginning of the test, and the increase in

14 Volume 27 Number 3 May APPLIED PSYCHOLOGICAL MEASUREMENT Table 7 Detection Rate for Violation of Local Independence k = 30 k = 60 n = 400 n = 1,000 n = 400 n = 1,000 p = 1/6 p = 1/3 p = 1/2 p = 1/6 p = 1/3 p = 1/2 p = 1/6 p = 1/3 p = 1/2 p = 1/6 p = 1/3 p = 1/2 l W UB ζ ζ T lag T T MAE normal MAE aberrant Note. MAE = mean absolute error. Table 8 False Alarm Rate for Violation of Local Independence k = 30 k = 60 n = 400 n = 1,000 n = 400 n = 1,000 p = 1/6 p = 1/3 p = 1/2 p = 1/6 p = 1/3 p = 1/2 p = 1/6 p = 1/3 p = 1/2 p = 1/6 p = 1/3 p = 1/2 l W UB ζ ζ T lag T T Note. MAE = mean absolute error. lag also resulted in an increase in the total score on the first part of the test. Note, however, that in Table 8 it can be seen that the false alarm rate of this test also increased. Discussion Aberrant response behavior in psychological and educational testing may result in inadequate measurement of some persons. Therefore, misfitting item scores should be detected and removed from the sample. To classify an item score pattern as nonfitting, the researcher can simply take the top 1% or top 5% of aberrant cases. As an alternative, the researcher can use a theoretical sampling distribution or simulate data sets based on the estimated item parameters in the sample. Person-fit statistics can be used as descriptive statistics. In this study, however, person-fit statistics were used to test the hypothesis that an item score pattern is not in agreement with the underlying test model. Simulation methods thus far applied in the literature did not take into account the uncertainty about the item and person parameters of the IRT model. In this study, we used Bayesian methods that take into account this uncertainty to classify an item score pattern as fitting or nonfitting. Although Bayesian methods are statistically superior to other simulation methods, a drawback is that they are relatively complex and computational intensive.

15 C. A. W. GLAS and R. R. MEIJER BAYESIAN PERSON FIT INDICES 231 Depending on the type of data and the problems envisaged, a researcher may choose a particular person-fit statistic, although not all statistics have equally favorable properties in a statistical sense. In general, sound statistical methods have been derived for the Rasch model, but because this model is rather restrictive to empirical data, the use of these statistics is also restricted. For the 2PL and the 3PL model and for short tests and tests of moderate length (say, 10 to 60 items), due to the use of ˆθ rather than θ, for most statistics the nominal Type I error rate under the standard normal distribution is not in agreement with the empirical Type I error rate (van Krimpen-Stoop & Meijer, 1999). As an alternative, one may use the correction proposed by Snijders (2001) or one may use Bayesian simulation procedures discussed in this article. From the results, it can be concluded that even for a test as short as 30 items and for 400 simulees, the Type I error is well under control (approximately.03 at a nominal.05 for most statistics studied). In particular, it is interesting to compare these results with the results obtained using the theoretical distribution. For example, Snijders (2001) found for nominal α =.05, empirical α between.053 and.061 in a simulation study with 100,000 replications. He used the 2PL model, the log-likelihood statistic corrected for ˆθ (resulting in standard errors between.001 and.005) and a 15-item test. Note, however, that he considered the item parameters to be known. This is only realistic when the item parameters can be estimated very accurately, that is, for very large sample sizes. For small sample sizes, the method proposed in this study may be more suitable. Detection rates differed for different statistics and different types of model violations simulated. In general, it can be concluded that the detection rates for guessing and item disclosure were higher than for violations against local independence. Note, however, that the MAE was relatively small in the latter case, in contrast to the MAE for the guessing condition. Also for item disclosure, the MAE was often slightly larger for misfitting score patterns compared to the MAE for fitting score patterns, although the power of some person-fit statistics was high (Table 5). Aggregated over all conditions, the ζ 2 -test had the highest power. The expectation that the UMP tests for person fit in the Rasch model (T 1, T 2, and T lag ) may also be superior in the framework of a 3PL model in an Bayesian framework was not corroborated. Traditional discrepancy tests do better here. Not reported above, but also included in the study, were versions of T 1, T 2, and T lag where the item scores were weighted by the discrimination parameters. The detection rates of these tests were consistently lower than those of the tests based on the unweighted scores. Interesting was that the detection rates decreased when the number of items affected by guessing increased. This is contrary to findings in earlier studies (Drasgow et al., 1985; Meijer & Sijtsma, 2001). This has a twofold explanation. First, in this study, a larger amount of guessing resulted in a substantial bias in the estimate of θ, in the sense that ˆθ was far lower than the original θ. As a result of using this lower ˆθ, item score patterns with decreased number of correct responses due to guessing were indexed as less aberrant than the index value they would receive using the original θ. In this study, the bias is further promoted by the fact that guessing was imposed on the easy items. Furthermore, in the extreme case where all the responses are guesses, ˆθ would become so low that the descriptive power of the item parameters for the responses would become negligible anyway, and a person-fit statistic would have no power at all. Second, this study differs from previous studies because the data of the aberrant simulees were included in the calibration sample, and the uncertainty regarding the item parameters was taken into account in the computation of the person-fit statistics. In an extreme case where a vast majority of the respondents are aberrant, the bias in the parameter estimates becomes huge, and again, the person-fit statistics loose their power. In fact, in that case, the definition of aberrancy is even no longer obvious. A final remark concerns the generalization of the procedure presented here to a general IRT framework incorporating models with multiple raters, testlet structures, latent classes, and

16 Volume 27 Number 3 May APPLIED PSYCHOLOGICAL MEASUREMENT multilevel structures (references given above). The common theme in these models is their complex dependency structure and the fact that these complex models can be estimated using the Gibbs sampler. In all cases, the structure of the estimation procedure is analogous: Draws from the posterior distribution are made by partitioning the complete parameter vector into a number of components and sampling each component conditionally on the draws for the other component. Usually, the partition of the complete parameter vector is in the item parameters, the person parameters, augmented data (such as Z and W above) and hyperparameters which may be related to some restrictions on the parameters (as in testlet and other multilevel IRT models) and some of the priors. In all these models, the statistics described above can be computed given the current draw of the item and person parameters, both for the observed data y and replicated data y rep drawn from the posterior predictive distribution. References Albert, J. H. (1992). Bayesian estimation of normal ogive item response functions using Gibbs sampling. Journal of Educational Statistics, 17, Andersen, E. B. (1973). A goodness of fit test for the Rasch model. Psychometrika, 38, Baker, F. B. (1998). An investigation of item parameter recovery characteristics of a Gibbs sampling procedure. Applied Psychological Measurement 22, Béguin, A. A., & Glas, C. A. W. (2001). MCMC estimation of multidimensional IRT models. Psychometrika, 66, Birnbaum, A. (1968). Some latent trait models. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores. Reading, MA: Addison- Wesley. Bradlow, E. T., Wainer, H., & Wang, X. (1999). A Bayesian random effects model for testlets. Psychometrika, 64, Drasgow, F., Levine, M. V., & McLaughlin, M. E. (1991). Appropriateness measurement for some multidimensional test batteries. Applied Psychological Measurement, 15, Drasgow, F., Levine, M. V., & Williams, E. A. (1985). Appropriateness measurement with polychotomous item response models and standardized indices. British Journal of Mathematical and Statistical Psychology, 38, Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, NJ: Lawrence Erlbaum. Fox, J. P., & Glas, C. A. W. (2001). Bayesian estimation of a multilevel IRT model using Gibbs sampling. Psychometrika, 66, Gelfand, A. E., & Smith, A. F. M. (1990). Samplingbased approaches to calculating marginal densities. Journal of the American Statistical Association 85, Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (1995). Bayesian data analysis. London: Chapman and Hall. Gelman, A., Meng, X.-L., & Stern, H. S. (1996). Posterior predictive assessment of model fitness via realized discrepancies. Statistica Sinica, 6, Glas, C. A. W. (1988). The derivation of some tests for the Rasch model from the multinomial distribution. Psychometrika, 53, Glas, C. A. W. (1999). Modification indices for the 2-pl and the nominal response model. Psychometrika, 64, Haladyna, T. M. (1994). Developing and validating multiple-choice test items. Hillsdale, NJ: Lawrence Erlbaum. Hambleton, R. K., & Swaminatan, H. (1985). Item response theory: Principles and applications (2nd ed.). Boston: Kluwer-Nijhoff. Hoijtink, H., & Molenaar, I. W. (1997). A multidimensional item response model: Constrained latent class analysis using the Gibbs sampler and posterior predictive checks. Psychometrika, 62, Jannarone, R. J. (1986). Conjunctive item response theory kernels. Psychometrika, 51, Janssen, R., Tuerlinckx, F., Meulders, M., & de Boeck, P. (2000). A hierarchical IRT model for criterionreferenced measurement. Journal of Educational and Behavioral Statistics, 25, Johnson, V. E., & Albert, J. H. (1999). Ordinal data modeling. New York: Springer. Kelderman, H. (1984). Loglinear RM tests. Psychometrika, 49, Klauer, K. C. (1991). An exact and optimal standardized person test for assessing consistency with the Rasch model. Psychometrika, 56, Klauer, K. C. (1995). The assessment of person fit. In G. H. Fischer &. I. W. Molenaar (Eds.), Rasch models, foundations, recent developments, and applications (pp ). New York: Springer- Verlag.

17 C. A. W. GLAS and R. R. MEIJER BAYESIAN PERSON FIT INDICES 233 Kim, S.-H. (2001). An evaluation of a Markov chain Monte Carlo method for the Rasch model. Applied Psychological Measurement, 25, Levine, M. V., & Drasgow, F. (1988). Optimal appropriateness measurement. Psychometrika, 53, Levine, M. V., & Rubin, D. B. (1979). Measuring the appropriateness of multiple-choice test scores. Journal of Educational Statistics, 4, Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum. Lord, F. M., & Novick, M.R. (1968). Statistical theories of mental test scores. Reading, MA: Addison- Wesley. Meijer, R. R., & Nering, M. L. (1997). Trait level estimation for nonfitting response vectors. Applied Psychological Measurement, 21, Meijer, R. R., & Sijtsma, K. (1995). Detection of aberrant item score patterns: A review and new developments, Applied Measurement in Education, 8, Meijer, R. R., & Sijtsma, K. (2001). Methodology review: Evaluating person fit. Applied Psychological Measurement, 25, Meng, X. L. (1994). Posterior predictive p-values. Annals of Statistics, 22, Mokken, R. J. (1971). A theory and procedure of scale analysis. Den Haag, the Netherlands: Mouton. Molenaar, I. W. (1983). Some improved diagnostics for failure in the Rasch model. Psychometrika, 48, Molenaar, I. W., & Hoijtink, H. (1990). The many null distributions of person fit indices. Psychometrika, 55, Orlando, M., & Thissen, D. (2000). Likelihood-based item-fit indices for dichotomous item response theory models. Applied Psychological Measurement, 24, Patz, R. J., & Junker, B. W. (1997). Applications and extensions of MCMC in IRT: Multiple item types, missing data, and rated responses (Technical Report No. 670). Pittsburgh, PA: Carnegie Mellon University, Department of Statistics. Patz, R. J., & Junker, B. W. (1999). A straightforward approach to Markov chain Monte Carlo methods for item response theory models. Journal of Educational and Behavioral Statistics, 24, Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen, Denmark: Danish Institute for Educational Research. Reise, S. P. (1995). Scoring method and the detection of person misfit in a personality assessment context. Applied Psychological Measurement, 19, Reise, S. P. (2000). Using multilevel logistic regression to evaluate person-fit in IRT models. Multivariate Behavioral Research, 35, Reise, S. P., & Widaman, K. F. (1999). Assessing the fit of measurement models at the individual level: A comparison of item response theory and covariance structure models. Psychological Methods, 4, Schmitt, N. S., Cortina, J. M., & Whitney, D. J. (1993). Appropriateness fit and criterion-related validity. Applied Psychological Measurement, 17, Smith, R. M. (1985). A comparison of Rasch person analysis and robust estimators. Educational and Psychological Measurement, 45, Smith, R. M. (1986). Person fit in the Rasch model. Educational and Psychological Measurement, 46, Snijders, T. (2001). Asymptotic distribution of person-fit statistics with estimated person parameter. Psychometrika, 66, Tatsuoka, K. K. (1984). Caution indices based on item response theory. Psychometrika, 49, van der Linden, W. J., & Hambleton, R. K. (Eds.). (1997). Handbook of modern item response theory. New York: Springer-Verlag. van Krimpen-Stoop, E. M. L. A., & Meijer, R. R. (1999). The null distribution of person-fit statistics for conventional and adaptive tests. Applied Psychological Measurement, 23, Wainer, H., Bradlow, E. T., & Du, Z. (2000). Testlet response theory: An analogue for the 3-PL useful in testlet-based adaptive testing. In W. J. van der Linden & C. A. W. Glas (Eds.), Computer adaptive testing: Theory and practice. Boston: Kluwer- Nijhoff. Wright, B. D., & Stone, M. H. (1979). Best test design. Chicago: MESA Press University of Chicago. Yen, W. M. (1981). Using simultaneous results to choose a latent trait model. Applied Psychological Measurement, 5, Yen, W. M. (1984). Effects of local item dependence on the fit and equating performance of the threeparameter logistic model. Applied Psychological Measurement, 8, Yen, W. M. (1993). Scaling performance assessments: Strategies for managing local item dependence. Journal of Educational Measurement, 30, Zimowski, M. F., Muraki, E., Mislevy, R. J., & Bock, R. D. (1996). Bilog MG: Multiple-group IRT analysis and test maintenance for binary items. Chicago: Scientific Software International, Inc.

Using SAS PROC MCMC to Estimate and Evaluate Item Response Theory Models

Using SAS PROC MCMC to Estimate and Evaluate Item Response Theory Models Clement A Stone Abstract Interest in estimating item response theory (IRT) models using Bayesian methods has grown tremendously

Utilization of Response Time in Data Forensics of K-12 Computer-Based Assessment XIN LUCY LIU. Data Recognition Corporation (DRC) VINCENT PRIMOLI

Response Time in K 12 Data Forensics 1 Utilization of Response Time in Data Forensics of K-12 Computer-Based Assessment XIN LUCY LIU Data Recognition Corporation (DRC) VINCENT PRIMOLI Data Recognition

A Cautionary Note on Using G 2 (dif) to Assess Relative Model Fit in Categorical Data Analysis

MULTIVARIATE BEHAVIORAL RESEARCH, 4(), 55 64 Copyright 2006, Lawrence Erlbaum Associates, Inc. A Cautionary Note on Using G 2 (dif) to Assess Relative Model Fit in Categorical Data Analysis Albert Maydeu-Olivares

Computerized Adaptive Testing in Industrial and Organizational Psychology. Guido Makransky

Computerized Adaptive Testing in Industrial and Organizational Psychology Guido Makransky Promotiecommissie Promotor Prof. dr. Cees A. W. Glas Assistent Promotor Prof. dr. Svend Kreiner Overige leden Prof.

Bayesian Statistics: Indian Buffet Process

Bayesian Statistics: Indian Buffet Process Ilker Yildirim Department of Brain and Cognitive Sciences University of Rochester Rochester, NY 14627 August 2012 Reference: Most of the material in this note

Diagnostic Scores Augmented Using Multidimensional Item Response Theory: Preliminary Investigation of MCMC Strategies

Diagnostic Scores Augmented Using Multidimensional Item Response Theory: Preliminary Investigation of MCMC Strategies David Thissen & Michael C. Edwards The University of North Carolina at Chapel Hill

How to report the percentage of explained common variance in exploratory factor analysis

UNIVERSITAT ROVIRA I VIRGILI How to report the percentage of explained common variance in exploratory factor analysis Tarragona 2013 Please reference this document as: Lorenzo-Seva, U. (2013). How to report

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives

Bootstrapping Big Data

Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu

EC 6310: Advanced Econometric Theory July 2008 Slides for Lecture on Bayesian Computation in the Nonlinear Regression Model Gary Koop, University of Strathclyde 1 Summary Readings: Chapter 5 of textbook.

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

Handling attrition and non-response in longitudinal data

Longitudinal and Life Course Studies 2009 Volume 1 Issue 1 Pp 63-72 Handling attrition and non-response in longitudinal data Harvey Goldstein University of Bristol Correspondence. Professor H. Goldstein

Modeling and Analysis of Call Center Arrival Data: A Bayesian Approach

Modeling and Analysis of Call Center Arrival Data: A Bayesian Approach Refik Soyer * Department of Management Science The George Washington University M. Murat Tarimcilar Department of Management Science

A model-based approach to goodness-of-fit evaluation in item response theory

A MODEL-BASED APPROACH TO GOF EVALUATION IN IRT 1 A model-based approach to goodness-of-fit evaluation in item response theory DL Oberski JK Vermunt Dept of Methodology and Statistics, Tilburg University,

Keith A. Boughton, Don A. Klinger, and Mark J. Gierl. Center for Research in Applied Measurement and Evaluation University of Alberta

Effects of Random Rater Error on Parameter Recovery of the Generalized Partial Credit Model and Graded Response Model Keith A. Boughton, Don A. Klinger, and Mark J. Gierl Center for Research in Applied

Glossary of Terms Ability Accommodation Adjusted validity/reliability coefficient Alternate forms Analysis of work Assessment Battery Bias

Glossary of Terms Ability A defined domain of cognitive, perceptual, psychomotor, or physical functioning. Accommodation A change in the content, format, and/or administration of a selection procedure

Application of a Psychometric Rating Model to

Application of a Psychometric Rating Model to Ordered Categories Which Are Scored with Successive Integers David Andrich The University of Western Australia A latent trait measurement model in which ordered

Chapter 6: The Information Function 129. CHAPTER 7 Test Calibration

Chapter 6: The Information Function 129 CHAPTER 7 Test Calibration 130 Chapter 7: Test Calibration CHAPTER 7 Test Calibration For didactic purposes, all of the preceding chapters have assumed that the

Optimal Assembly of Tests with Item Sets

Optimal Assembly of Tests with Item Sets Wim J. van der Linden University of Twente Six computational methods for assembling tests from a bank with an item-set structure, based on mixed-integer programming,

The Proportional Odds Model for Assessing Rater Agreement with Multiple Modalities

The Proportional Odds Model for Assessing Rater Agreement with Multiple Modalities Elizabeth Garrett-Mayer, PhD Assistant Professor Sidney Kimmel Comprehensive Cancer Center Johns Hopkins University 1

What Does the Correlation Coefficient Really Tell Us About the Individual?

What Does the Correlation Coefficient Really Tell Us About the Individual? R. C. Gardner and R. W. J. Neufeld Department of Psychology University of Western Ontario ABSTRACT The Pearson product moment

Package EstCRM. July 13, 2015

Version 1.4 Date 2015-7-11 Package EstCRM July 13, 2015 Title Calibrating Parameters for the Samejima's Continuous IRT Model Author Cengiz Zopluoglu Maintainer Cengiz Zopluoglu

Item Response Theory: What It Is and How You Can Use the IRT Procedure to Apply It

Paper SAS364-2014 Item Response Theory: What It Is and How You Can Use the IRT Procedure to Apply It Xinming An and Yiu-Fai Yung, SAS Institute Inc. ABSTRACT Item response theory (IRT) is concerned with

Chenfeng Xiong (corresponding), University of Maryland, College Park (cxiong@umd.edu)

Paper Author (s) Chenfeng Xiong (corresponding), University of Maryland, College Park (cxiong@umd.edu) Lei Zhang, University of Maryland, College Park (lei@umd.edu) Paper Title & Number Dynamic Travel

The last couple of years have witnessed booming development of massive open online

Abstract Massive open online courses (MOOCs) are playing an increasingly important role in higher education around the world, but despite their popularity, the measurement of student learning in these

Markov Chain Monte Carlo Simulation Made Simple

Markov Chain Monte Carlo Simulation Made Simple Alastair Smith Department of Politics New York University April2,2003 1 Markov Chain Monte Carlo (MCMC) simualtion is a powerful technique to perform numerical

Introduction to latent variable models

Introduction to latent variable models lecture 1 Francesco Bartolucci Department of Economics, Finance and Statistics University of Perugia, IT bart@stat.unipg.it Outline [2/24] Latent variables and their

Cognitive diagnostic assessment as an alternative measurement model by Vahid Aryadoust (National Institute of Education, Singapore) Abstract

Cognitive diagnostic assessment as an alternative measurement model by Vahid Aryadoust (National Institute of Education, Singapore) Abstract Cognitive diagnostic assessment (CDA) is a development in psycho-educational

Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

Poisson Models for Count Data

Chapter 4 Poisson Models for Count Data In this chapter we study log-linear models for count data under the assumption of a Poisson error structure. These models have many applications, not only to the

Item Imputation Without Specifying Scale Structure

Original Article Item Imputation Without Specifying Scale Structure Stef van Buuren TNO Quality of Life, Leiden, The Netherlands University of Utrecht, The Netherlands Abstract. Imputation of incomplete

Assessing the Relative Fit of Alternative Item Response Theory Models to the Data

Research Paper Assessing the Relative Fit of Alternative Item Response Theory Models to the Data by John Richard Bergan, Ph.D. 6700 E. Speedway Boulevard Tucson, Arizona 85710 Phone: 520.323.9033 Fax:

Center for Advanced Studies in Measurement and Assessment. CASMA Research Report

Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 13 and Accuracy Under the Compound Multinomial Model Won-Chan Lee November 2005 Revised April 2007 Revised April 2008

Performance of the bootstrap Rasch model test under violations of non-intersecting item response functions

Psychological Test and Assessment Modeling, Volume 53, 011 (3), 83-94 Performance of the bootstrap Rasch model test under violations of non-intersecting item response functions Moritz Heene 1, Clemens

Probabilistic Methods for Time-Series Analysis

Probabilistic Methods for Time-Series Analysis 2 Contents 1 Analysis of Changepoint Models 1 1.1 Introduction................................ 1 1.1.1 Model and Notation....................... 2 1.1.2 Example:

Combining information from different survey samples - a case study with data collected by world wide web and telephone

Combining information from different survey samples - a case study with data collected by world wide web and telephone Magne Aldrin Norwegian Computing Center P.O. Box 114 Blindern N-0314 Oslo Norway E-mail:

Chapter 1 Introduction. 1.1 Introduction

Chapter 1 Introduction 1.1 Introduction 1 1.2 What Is a Monte Carlo Study? 2 1.2.1 Simulating the Rolling of Two Dice 2 1.3 Why Is Monte Carlo Simulation Often Necessary? 4 1.4 What Are Some Typical Situations

Analysis of Bayesian Dynamic Linear Models

Analysis of Bayesian Dynamic Linear Models Emily M. Casleton December 17, 2010 1 Introduction The main purpose of this project is to explore the Bayesian analysis of Dynamic Linear Models (DLMs). The main

2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)

2DI36 Statistics 2DI36 Part II (Chapter 7 of MR) What Have we Done so Far? Last time we introduced the concept of a dataset and seen how we can represent it in various ways But, how did this dataset came

Validation of Software for Bayesian Models using Posterior Quantiles. Samantha R. Cook Andrew Gelman Donald B. Rubin DRAFT

Validation of Software for Bayesian Models using Posterior Quantiles Samantha R. Cook Andrew Gelman Donald B. Rubin DRAFT Abstract We present a simulation-based method designed to establish that software

BayesX - Software for Bayesian Inference in Structured Additive Regression

BayesX - Software for Bayesian Inference in Structured Additive Regression Thomas Kneib Faculty of Mathematics and Economics, University of Ulm Department of Statistics, Ludwig-Maximilians-University Munich

Bayesian Methods. 1 The Joint Posterior Distribution

Bayesian Methods Every variable in a linear model is a random variable derived from a distribution function. A fixed factor becomes a random variable with possibly a uniform distribution going from a lower

From the help desk: Bootstrapped standard errors

The Stata Journal (2003) 3, Number 1, pp. 71 80 From the help desk: Bootstrapped standard errors Weihua Guan Stata Corporation Abstract. Bootstrapping is a nonparametric approach for evaluating the distribution

Validation of Software for Bayesian Models Using Posterior Quantiles

Validation of Software for Bayesian Models Using Posterior Quantiles Samantha R. COOK, Andrew GELMAN, and Donald B. RUBIN This article presents a simulation-based method designed to establish the computational

STAT3016 Introduction to Bayesian Data Analysis

STAT3016 Introduction to Bayesian Data Analysis Course Description The Bayesian approach to statistics assigns probability distributions to both the data and unknown parameters in the problem. This way,

Least Squares Estimation

Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

Marketing Mix Modelling and Big Data P. M Cain

1) Introduction Marketing Mix Modelling and Big Data P. M Cain Big data is generally defined in terms of the volume and variety of structured and unstructured information. Whereas structured data is stored

Modelling non-ignorable missing-data mechanisms with item response theory models

1 British Journal of Mathematical and Statistical Psychology (2005), 58, 1 17 q 2005 The British Psychological Society The British Psychological Society www.bpsjournals.co.uk Modelling non-ignorable missing-data

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

Bayesian inference for population prediction of individuals without health insurance in Florida

Bayesian inference for population prediction of individuals without health insurance in Florida Neung Soo Ha 1 1 NISS 1 / 24 Outline Motivation Description of the Behavioral Risk Factor Surveillance System,

It is important to bear in mind that one of the first three subscripts is redundant since k = i -j +3.

IDENTIFICATION AND ESTIMATION OF AGE, PERIOD AND COHORT EFFECTS IN THE ANALYSIS OF DISCRETE ARCHIVAL DATA Stephen E. Fienberg, University of Minnesota William M. Mason, University of Michigan 1. INTRODUCTION

Statistical Analysis with Missing Data

Statistical Analysis with Missing Data Second Edition RODERICK J. A. LITTLE DONALD B. RUBIN WILEY- INTERSCIENCE A JOHN WILEY & SONS, INC., PUBLICATION Contents Preface PARTI OVERVIEW AND BASIC APPROACHES

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition)

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition) Abstract Indirect inference is a simulation-based method for estimating the parameters of economic models. Its

Penalized regression: Introduction

Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood

C: LEVEL 800 {MASTERS OF ECONOMICS( ECONOMETRICS)}

C: LEVEL 800 {MASTERS OF ECONOMICS( ECONOMETRICS)} 1. EES 800: Econometrics I Simple linear regression and correlation analysis. Specification and estimation of a regression model. Interpretation of regression

Local outlier detection in data forensics: data mining approach to flag unusual schools

Local outlier detection in data forensics: data mining approach to flag unusual schools Mayuko Simon Data Recognition Corporation Paper presented at the 2012 Conference on Statistical Detection of Potential

Data Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber 2011 1

Data Modeling & Analysis Techniques Probability & Statistics Manfred Huber 2011 1 Probability and Statistics Probability and statistics are often used interchangeably but are different, related fields

The Variability of P-Values. Summary

The Variability of P-Values Dennis D. Boos Department of Statistics North Carolina State University Raleigh, NC 27695-8203 boos@stat.ncsu.edu August 15, 2009 NC State Statistics Departement Tech Report

Linear Threshold Units

Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

Simple Second Order Chi-Square Correction

Simple Second Order Chi-Square Correction Tihomir Asparouhov and Bengt Muthén May 3, 2010 1 1 Introduction In this note we describe the second order correction for the chi-square statistic implemented

Centre for Central Banking Studies

Centre for Central Banking Studies Technical Handbook No. 4 Applied Bayesian econometrics for central bankers Andrew Blake and Haroon Mumtaz CCBS Technical Handbook No. 4 Applied Bayesian econometrics

CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS

Examples: Regression And Path Analysis CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS Regression analysis with univariate or multivariate dependent variables is a standard procedure for modeling relationships

Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

Psychometrics in Practice at RCEC

Theo J.H.M. Eggen and Bernard P. Veldkamp (Editors) Psychometrics in Practice at RCEC RCEC i ISBN 978-90-365-3374-4 DOI http://dx.doi.org/10.3990/3.9789036533744 Copyright RCEC, Cito/University of Twente,

THE BASICS OF ITEM RESPONSE THEORY FRANK B. BAKER

THE BASICS OF ITEM RESPONSE THEORY FRANK B. BAKER THE BASICS OF ITEM RESPONSE THEORY FRANK B. BAKER University of Wisconsin Clearinghouse on Assessment and Evaluation The Basics of Item Response Theory

Applications of R Software in Bayesian Data Analysis

Article International Journal of Information Science and System, 2012, 1(1): 7-23 International Journal of Information Science and System Journal homepage: www.modernscientificpress.com/journals/ijinfosci.aspx

Estimating Industry Multiples

Estimating Industry Multiples Malcolm Baker * Harvard University Richard S. Ruback Harvard University First Draft: May 1999 Rev. June 11, 1999 Abstract We analyze industry multiples for the S&P 500 in

Goodness of fit assessment of item response theory models

Goodness of fit assessment of item response theory models Alberto Maydeu Olivares University of Barcelona Madrid November 1, 014 Outline Introduction Overall goodness of fit testing Two examples Assessing

Objections to Bayesian statistics

Bayesian Analysis (2008) 3, Number 3, pp. 445 450 Objections to Bayesian statistics Andrew Gelman Abstract. Bayesian inference is one of the more controversial approaches to statistics. The fundamental

Lab 8: Introduction to WinBUGS

40.656 Lab 8 008 Lab 8: Introduction to WinBUGS Goals:. Introduce the concepts of Bayesian data analysis.. Learn the basic syntax of WinBUGS. 3. Learn the basics of using WinBUGS in a simple example. Next

Checklists and Examples for Registering Statistical Analyses

Checklists and Examples for Registering Statistical Analyses For well-designed confirmatory research, all analysis decisions that could affect the confirmatory results should be planned and registered

Multivariate Normal Distribution

Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues

A THEORETICAL COMPARISON OF DATA MASKING TECHNIQUES FOR NUMERICAL MICRODATA

A THEORETICAL COMPARISON OF DATA MASKING TECHNIQUES FOR NUMERICAL MICRODATA Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University Agency Internal User Unmasked Result Subjects

Anomaly detection for Big Data, networks and cyber-security

Anomaly detection for Big Data, networks and cyber-security Patrick Rubin-Delanchy University of Bristol & Heilbronn Institute for Mathematical Research Joint work with Nick Heard (Imperial College London),

11. Time series and dynamic linear models

11. Time series and dynamic linear models Objective To introduce the Bayesian approach to the modeling and forecasting of time series. Recommended reading West, M. and Harrison, J. (1997). models, (2 nd

The primary goal of this thesis was to understand how the spatial dependence of

5 General discussion 5.1 Introduction The primary goal of this thesis was to understand how the spatial dependence of consumer attitudes can be modeled, what additional benefits the recovering of spatial

Bayesian Analysis for the Social Sciences

Bayesian Analysis for the Social Sciences Simon Jackman Stanford University http://jackman.stanford.edu/bass November 9, 2012 Simon Jackman (Stanford) Bayesian Analysis for the Social Sciences November

Lecture 6: The Bayesian Approach

Lecture 6: The Bayesian Approach What Did We Do Up to Now? We are given a model Log-linear model, Markov network, Bayesian network, etc. This model induces a distribution P(X) Learning: estimate a set

Dealing with Missing Data

Res. Lett. Inf. Math. Sci. (2002) 3, 153-160 Available online at http://www.massey.ac.nz/~wwiims/research/letters/ Dealing with Missing Data Judi Scheffer I.I.M.S. Quad A, Massey University, P.O. Box 102904

Bayesian Hidden Markov Models for Alcoholism Treatment Tria

Bayesian Hidden Markov Models for Alcoholism Treatment Trial Data May 12, 2008 Co-Authors Dylan Small, Statistics Department, UPenn Kevin Lynch, Treatment Research Center, Upenn Steve Maisto, Psychology

Basics of Statistical Machine Learning

CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

Chapter 14 Managing Operational Risks with Bayesian Networks

Chapter 14 Managing Operational Risks with Bayesian Networks Carol Alexander This chapter introduces Bayesian belief and decision networks as quantitative management tools for operational risks. Bayesian

Gaussian Processes to Speed up Hamiltonian Monte Carlo

Gaussian Processes to Speed up Hamiltonian Monte Carlo Matthieu Lê Murray, Iain http://videolectures.net/mlss09uk_murray_mcmc/ Rasmussen, Carl Edward. "Gaussian processes to speed up hybrid Monte Carlo

Sample Size Designs to Assess Controls

Sample Size Designs to Assess Controls B. Ricky Rambharat, PhD, PStat Lead Statistician Office of the Comptroller of the Currency U.S. Department of the Treasury Washington, DC FCSM Research Conference

Parametric fractional imputation for missing data analysis

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 Biometrika (????,??,?, pp. 1 14 C???? Biometrika Trust Printed in

MINITAB ASSISTANT WHITE PAPER

MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way

Lecture #2 Overview. Basic IRT Concepts, Models, and Assumptions. Lecture #2 ICPSR Item Response Theory Workshop

Basic IRT Concepts, Models, and Assumptions Lecture #2 ICPSR Item Response Theory Workshop Lecture #2: 1of 64 Lecture #2 Overview Background of IRT and how it differs from CFA Creating a scale An introduction

Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation

Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2015 CS 551, Fall 2015

: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

A Latent Variable Approach to Validate Credit Rating Systems using R

A Latent Variable Approach to Validate Credit Rating Systems using R Chicago, April 24, 2009 Bettina Grün a, Paul Hofmarcher a, Kurt Hornik a, Christoph Leitner a, Stefan Pichler a a WU Wien Grün/Hofmarcher/Hornik/Leitner/Pichler

Statistical Models in R

Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Structure of models in R Model Assessment (Part IA) Anova

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Recall areal data also known as lattice data are data Y (s), s D where D is a discrete index set. This usually corresponds to data

Factor analysis. Angela Montanari

Factor analysis Angela Montanari 1 Introduction Factor analysis is a statistical model that allows to explain the correlations between a large number of observed correlated variables through a small number

Basic Bayesian Methods

6 Basic Bayesian Methods Mark E. Glickman and David A. van Dyk Summary In this chapter, we introduce the basics of Bayesian data analysis. The key ingredients to a Bayesian analysis are the likelihood

PS 271B: Quantitative Methods II. Lecture Notes

PS 271B: Quantitative Methods II Lecture Notes Langche Zeng zeng@ucsd.edu The Empirical Research Process; Fundamental Methodological Issues 2 Theory; Data; Models/model selection; Estimation; Inference.

Inference on Phase-type Models via MCMC

Inference on Phase-type Models via MCMC with application to networks of repairable redundant systems Louis JM Aslett and Simon P Wilson Trinity College Dublin 28 th June 202 Toy Example : Redundant Repairable

Generating Valid 4 4 Correlation Matrices

Applied Mathematics E-Notes, 7(2007), 53-59 c ISSN 1607-2510 Available free at mirror sites of http://www.math.nthu.edu.tw/ amen/ Generating Valid 4 4 Correlation Matrices Mark Budden, Paul Hadavas, Lorrie

Summary of Probability

Summary of Probability Mathematical Physics I Rules of Probability The probability of an event is called P(A), which is a positive number less than or equal to 1. The total probability for all possible