Using SAS PROC MCMC to Estimate and Evaluate Item Response Theory Models Clement A Stone Abstract Interest in estimating item response theory (IRT) models using Bayesian methods has grown tremendously in recent years. In part this is due to the appeal of the Bayesian paradigm among psychometricians and statisticians, as well as the advantages of these methods with small sample sizes and for more complex or highly parameterized models (e.g., multidimensional IRT models). Recently, routines have become available in SAS to also implement general Bayesian estimation methods (PROC MCMC). The purpose of this paper is to illustrate the use of PROC MCMC to estimate and evaluate the fit of IRT models, as well as to compare usage with WinBUGS. Examples will be presented that illustrate the syntax used to estimate the models (simple and more complex IRT models), and output will be presented to illustrate the type of output that is available to evaluate the convergence of the MCMC algorithm. Finally, tools for comparing competing models will be discussed as well as use of the posterior predictive distribution to evaluate model fit. 1 Item Response Theory Models Psychometricians (c.f., Hambleton and Swaminathan, 1985; Embretson and Reise, 2000) have discussed the advantages of applying item response theory (IRT) as opposed to classical test theory (CTT) to scale development. IRT consists of a family of models which can be used to predict performance on each item, or the probability of choosing a response category, using characteristics of items and persons. More specifically, IRT models consist of one or more latent ability/trait parameters (person characteristics) and different item parameters (Embretson and Reise, 2000; Hambleton and Swaminathan, 1985). These parameters in conjunction with a mathematical function (normal ogive or logistic functions) may be used to model the probability of a response (e.g., correct/incorrect, or endorsing a particular response option): Clement A Stone, University of Pittsburgh; email: cas@pitt.edu
2 Clement A Stone P j (U j = u j ω j, θ) where, θ is a potentially multidimensional vector of person characteristics (e.g. abilities or traits being measured) and ω j is a vector of item characteristics included in the item response function for the IRT model. Different IRT models can be applied to assessments that vary in terms of the scoring options as well as the number of traits being measured. For, dichotomously scored items standard IRT models may be used such as the Rasch or 1-Parmaeter, 2- Parameter, and 3-Parameter IRT models (Hambleton and Swaminathan, 1985). For polytomously scored items such as with ordinal rating scales, the graded response (GR) IRT model (Samejima, 1969) or polytomous extensions of Rasch or 1-Parameter IRT models (Muraki,1992) may be used. Multi-dimensional extensions of these IRT models (compensatory and bi-factor models) incorporate additional person parameters to reflect multiple traits being measured, and additional item parameters to reflect the extent to which each item is related to each dimension (c.f., Muraki and Carlson, 1995). Also, when local dependence (LD) among a set of items exists, the dependency within items can be modelled within a modified IRT model (e.g., Wang, Bradlow, and Wainer 2002). These latter models are more highly parameterized which complicates the estimation process, and may demand alternative methods for estimating parameters. The purpose of this paper is to illustrate Bayesian estimation of a subset of these models using SAS PROC MCMC. The models illustrated include unidimensional and multidimension IRT models for dichotomously scored items. It should be noted that use of SAS PROC MCMC is not limited to these models. As will be discussed all that is necessary to estimate other models is to change either the models or likelihood equations that are defined to reflect desired parameterizations. 2 Bayesian Estimation of IRT Models In contrast to traditional statistics, a Bayesian paradigm considers model parameters to be random variables and is used to obtain distributions for model parameters from which direct probability statements about parameters may be made. Bayesian analysis begins with an assessment about the uncertainty in a parameter (prior distribution) before data is collected. Using Bayes theorem, a distribution for the parameter given the data (posterior distribution) may be obtained by updating beliefs about how parameters are distributed (prior distribution) with the distribution of data given the parameters (likelihood function:: p(θ D)= p(d θ) p(θ) / p(d) where p(d θ) is the likelihood function (Probability of data, D, given possible values of θ), p(θ) is the prior distribution for the parameter θ or prior belief or uncertainty about θ, and p(d) is the marginal or unconditional probability of data across all possible values of θ also described as the probability of data being generated by the statistical model. See Fox (2010) for a discussion of applications to IRT models. In addition to the useful interpretations afforded by a Bayesian paradigm, other advantages to using Bayesian methods to estimate IRT models include the following: Parameter estimates are less variable than Maximum Likelihood (ML) estimates, and are generally closer to true values in small samples and short tests
Using SAS PROC MCMC to Estimate and Evaluate IRT Models 3 Methods accommodate perfect and imperfect response patterns Ability or trait parameter estimates available simultaneously with item parameter estimates; therefore, uncertainty in item parameter estimates incorporated directly into ability or trait parameter estimates Provides a more stable estimation process; model parameter estimates do not drift out of bounds Can be used readily with complex or highly parameterized models (parameters that reflect additional facets of an assessment) For most models, Bayesian analysis and computing posterior distributions involves complex integrals for marginal probability distributions. As the number of model parameters increases, the set of integrals defined by p(d) is numerically intractable. To solve this problem, MCMC (Markov Chain Monte Carlo) methods may be used to generate a sample of parameter values from the joint posterior distributions for parameters. Thus, simulations are used to replace the complex mathematical calculations, and the availability of increased computing power has made Bayesian analysis via MCMC much more accessible. See, for example, Fox (2010) for a discussion of MCMC algorithms 3 Using SAS PROC MCMC to Estimate IRT Models Bayesian analysis by the Markov Chain Monte Carlo (MCMC) method has been greatly simplified through the WinBUGS computer program (Spiegelhalter, Thomas, and Best 2000). Recently, routines have become available in SAS to also implement general Bayesian estimation methods (PROC MCMC). SAS has several significant advantages over WinBUGS: 1) It is commonly used by researchers across disciplines; 2) It provides a robust programming language that extends the capability of the program; and 3) It displays increased performance and efficiency. As in WinBUGS, specification of the model is straightforward. All that is necessary is to specify the model or likelihood, model parameters, and prior distributions for model parameters. Information about prior distributions for parameters may be available from previous assessments, from the purpose of the assessment, or from the known properties of IRT response functions. For example, specification of prior distributions can consider whether an assessment serves as a broad measure of achievement rather than designed to measure the trait or ability in a local region of the trait or ability continuum. However, often vague or uninformative priors are defined to mitigate the influence of priors on the estimation process. For these types of priors, the probability distributions are relatively flat across the range of possible values for the parameter. 3.1 SAS PROC MCMC Template 1P IRT Model As a template for using PROC MCMC, the PROC MCMC command to estimate a 1- Parameter IRT model for 5 dichotomously scored items is as follows: PROC MCMC data=<item response dataset> outpost=<sas dataset for posterior distribution results>;
4 Clement A Stone array b[5]; array x[5]; array p[5]; * specify arrays for variables; parms a 1 b: 0; * specify initial values for item parameters; prior b: ~ normal(0, var=value) ; * define prior distributions for item parameters; prior a ~ lognormal(0, var=value); * specify prior distribution for trait parameters random effects in the model; random theta ~ normal(0, var=1) subject=id; * specify likelihood for IRT model; do j=1 to 5; p[j] = 1/(1+exp(-a*(theta - b[j]))); end; model x1 ~ binary(p1); model x2 ~ binary(p2); model x3 ~ binary(p3); model x4 ~ binary(p4); model x5 ~ binary(p5); Note that Normal and Lognormal priors are typically specified for the b and a parameters, respectively. Also, note that missing values are accommodated in PROC MCMC; any missing responses are treated as random variables and sampled directly using the response model. Alternatively, the above likelihood expression can be specified more generally as: llike = 0; do j=1 to 5; prob = 1/(1+exp(-a*(theta - b[j]))); llike = llike + x[j] * log(prob) + (1 - x[j]) * log(1 - prob); end; model general(llike); Such a specification allows for any likelihood for an IRT model to be specified. In particular, since no MODEL statement is available to accommodate non-binary responses, this specification is necessary for polytomously scored items. Since the MCMC method uses simulation procedures, it is important to determine if the Markov Chain has converged to the target posterior distribution. If the chain does not converge, the simulated draws from this chain would not represent the posterior distribution of interest. Therefore, it is very important to assess convergence of Markov chains before any making Bayesian inferences. As is the case with WinBUGS, numerous diagnostic tools are available to determine the extent to which the chain has converged. One popular tool are trace plots of sampled parameter values versus the iteration number in the chain. A chain that mixes well traverses the posterior space rapidly and reaches all regions of the relevant parameter space. In addition to trace plots, other commonly used statistics are available. Also available are a number of statistics for analysing the posterior distribution. These include the mean, standard deviation, percentiles, Bayesian equal-tail credible interval, highest posterior density (HPD) interval, and Deviance Information Criterion (DIC). DIC (Spiegelhalter et al., 2002) is a model comparison tool similar to the well known Akaike information criterion (AIC) and the Bayesian information criterion
Using SAS PROC MCMC to Estimate and Evaluate IRT Models 5 (BIC). DIC can be applied to nonnested models and models that have non-iid data. Smaller DIC values indicate better fit to the data. 4 Evaluating the Fit of IRT Models While other software is available that simplifies Bayesian estimation of IRT models (e.g., WinBugs), it is equally important to evaluate the fit of a particular IRT model to item responses. When a model doesn t fit the data, the validity of any inferences for the model parameter estimates will be threatened. SAS PROC MCMC affords advantages over other available software for Bayesian estimation of IRT models. For evaluating model-data-fit, in particular, SAS provides a robust programming language and built-in statistical/graphical tools, all of which expands the capability of the program beyond estimating the model to computing many different types of statistics for comparing competing models and evaluating model fit. The Posterior Predictive Model Checking (PPMC) method is a commonly used Bayesian model checking tool and has proved useful with IRT models (e.g., Levy, Mislevy, and Sinharay, 2009; Sinharay, 2006; Sinharay, Johnson, and Stern, 2006; Zhu and Stone, 2011). PPMC involves simulating data under the assumed or estimated model and comparing features of simulated data against observed data using discrepancy measures or statistics calculated on both the simulated and observed data (e.g., residual or total score). Specifically, PPMC assesses the fit of a model by examining whether the observed data, y, appears extreme with respect to data replicated from the model (y rep ): p rep rep rep ( y y ) = p( y, θ y) dθ p( y θ) p( θ y) dθ = The rationale underlying PPMC is that if a model fits the data, then observed data should be similar to replicated or simulated data based on the posterior distributions for model parameters (p(θ y)). In turn, distributions of discrepancy measures calculated on both simulated and observed data should also be similar if a model fits the data. As would be expected, choosing appropriate discrepancy measures is important when assessing model-fit using PPMC. Measures should be chosen to reflect potential sources of misfit most relevant to a particular application (Zhu and Stone, 2011; Zhu and Stone, in press). For example, when a unidimensional IRT model is estimated for item responses that may reflect a multidimensional structure, discrepancy measures should be chosen that are sensitive to this threat to model fit. 5 Presentation Format Examples will be presented that illustrate the syntax used to estimate the models (simple and more complex IRT models), and output will be presented to illustrate the type of diagnostics that are available to evaluate the convergence of the MCMC algorithm. Specifications of priors and how problems with convergence can be addressed will also be discussed. Finally, tools for comparing competing models will be
6 Clement A Stone discussed, as well as use of the posterior predictive distribution to evaluate model fit and the ease with which PPMC may be implemented in SAS PROC MCMC. Some of the discrepancy measures to be discussed will be based on Zhu s and Stone s (2011) research on Bayesian estimation of IRT models for performance assessments. References Embretson, S.E., Reise, S.P. Item response theory for psychologists. Mahwah: New Jersey: Lawrence Erlbaum. Fox, J.P. (2010). Bayesian item response modelling. NY: Springer. Hambleton, R. K., Swaminathan, H. (1985). Item response theory. Boston, MA: Kluwer-Nijhoff. Levy, R., Mislevy, R. J., Sinharay, S. (2009). Posterior predictive model checking for multidimensionality in item response theory. Appl. Psychol. Meas., 33(7), 519-537. Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Appl. Psychol. Meas., 16, 159-176. Muraki, E., Carlson, J. E. (1995). Full-information factor analysis for polytomous item responses. Appl. Psychol. Meas., 19, 73-90. Samejima, F. (1969). Estimation of latent ability groups using a response pattern of graded scores. Psychometrika Monograph Supplement, No. 17. Sinharay, S. (2006). Bayesian item fit analysis for unidimensional item response theory models. Brit. J. Math. Stat. Psy., 59, 429-449. Sinharay, S., Johnson, M. S., Stern, H. S. (2006). Posterior predictive assessment of item response theory models. Appl. Psychol. Meas., 30(4), 298-321. Spiegelhalter, D.J., Thomas, A., Best, N. (2000). WinBUGS v1.3. Cambridge, UK: MRC Biostatistics Unit. Wang, X., Bradlow, E. T., Wainer, H. (2002). A general Bayesian model for testlets: Theory and applications. Appl. Psychol. Meas., 26(1), 109-128. Zhu, X., Stone, C.A. (In Press). Bayesian comparison of alternative graded response models for performance assessment applications. Educ. Psychol. Meas. Zhu, X., Stone, C.A. (2011). Assessing fit of unidimensional graded response models using Bayesian methods. J. Educ. Meas., 48, 81-97.