Score Comparability of Online and Paper Administrations of the Texas Assessment of Knowledge and Skills. Walter D. Way. Laurie Laughlin Davis

Transcription

1 Score Comparability of Online and Paper Administrations of the Texas Assessment of Knowledge and Skills Walter D. Way Laurie Laughlin Davis Steven Fitzpatrick Pearson Educational Measurement Paper presented at the annual meeting of the National Council on Measurement in Education, San Francisco, CA, April, 2006

2 Score Comparability of Online and Paper Administrations of the Texas Assessment of Knowledge and Skills Introduction A rapidly increasing number of state education departments are exploring or implementing online assessments as part of their statewide assessment programs. The potential advantages of online testing in K-12 settings are obvious. These include quicker turnaround of results, cost savings related to printing and shipping paper test materials, improved test security, more flexible and less burdensome test administrations, and a technological basis for introducing innovative item formats and test delivery algorithms. In addition, recent surveys indicate that students testing online enjoy their experiences, feel comfortable with taking tests by computer, and tend to prefer it to traditional paper testing (Glassnapp, Poggio, Poggio, & Yang, 2005; O Malley et al., 2005; Ito & Sykes, 2004). In states where online testing has been introduced as part of their high-stakes assessments, not all schools have had the infrastructure and equipment to test online. For this reason, paper and online versions of the same tests are typically offered side-by-side. Any time both paper-based and online assessments co-exist, professional testing standards indicate the need to ensure comparable results across paper and online mediums. The Guidelines for Computer-Based Tests and Interpretations (APA, 1986) states:...when interpreting scores from the computerized versions of conventional tests, the equivalence of scores from computerized versions should be established and documented before using norms or cut scores obtained from conventional tests. (p. 18). The joint Standards for Educational and Psychological Testing also recommends empirical validation of score interpretations across computer-based and paperbased tests (AERA, APA, NCME, 1999, Standard 4.10). The comparability of test scores based on online versus paper testing has been studied for more than 20 years. Reviews of the comparability literature research were reported by Mazzeo and Harvey (1988), who reported mixed results, and Drasgow (1993), who concluded that there were essentially no differences in examinee scores by mode-of-administration for power tests. Paek (2005) provided a summary of more recent comparability research and concluded that, in general, computer and paper versions of traditional multiple-choice tests are comparable across grades and academic subjects. However, when tests are timed, differential speededness can lead 1

3 to mode effects. For example, a recent study by Ito and Sykes (2004) reported significantly lower performance on timed web-based norm-referenced tests at grades 4-12 compared with paper versions. These differences seemed to occur because students needed more time on the web-based test than they did on the paper test. Pommerich (2004) reported evidence of mode differences due to differential speededness in tests given at grades 11 and 12, but in her study online performance on questions near the end of several tests was higher than paper performance on these same items. She hypothesized that students who are rushed for time might actually benefit from testing online because the computer makes it easier to respond and move quickly from item to item. A number of studies have suggested that no mode differences can be expected when individual test items can be presented within a single screen (Poggio, Glassnapp, Yang, & Poggio, 2005; Hetter, Segall & Bloxom, 1997; Bergstrom, 1992; Spray, Ackerman, Reckase, & Carlson, 1989). However, when items are associated with text that requires scrolling, such as is typically the case with reading tests, studies have indicated lower performance for students testing online (O Malley, 2005; Pommerich, 2004; Bridgeman, Lennon, & Jackenthal, 2003; Choi & Tinkler, 2002; Bergstrom, 1992). In general, the results of comparability research are difficult to evaluate for several reasons. First, there has been a continual evolution in both computer technology and the computer skills of test-takers. Thus, earlier studies have limited generalizability, and more recent studies may not generalize well to future settings. Second, most comparability research is carried out in the context of operational testing programs, where less-than-desirable experimental control is usually the norm. In such studies, conclusions are often tempered because of design limitations such as lack of random assignment, insufficient statistical power, order-ofadministration effects, and effects due to differences in test forms given across modes. Finally, the content areas, test designs, test administration systems, and testing populations can differ considerably across comparability studies, and differences in any of these factors could lead to different findings from one study to another. For a policy maker interested in introducing online assessments for a high-stakes K-12 testing program, the need to assess comparability creates a number of challenges. While some stakeholders will lobby for immediate and widespread introduction of online testing, researchers 2

4 and psychometricians will advise more cautious and controlled experimental studies. Such studies can be expensive and usually require efforts beyond those needed to meet the usual challenges associated with the ongoing paper-based program. Furthermore, no matter how well a comparability study is designed, executing the design depends on the volunteer participation of individual schools and districts. As such, one can expect that schools will vary in their ability to execute the procedures called for in the experimental design, and that a nontrivial number of schools signed up for the study will invariably drop out. Poggio et al. (2005) and Poggio, Glasnapp, Yang, Beauchamp, and Dunham, (2005) reported on an approach to comparability research in the live context of Kansas assessment program that balanced an aggressive approach to online implementation with the need to collect comparability data. In their studies, all schools were invited to administer the Kansas Computerized Assessment (KCA), and online volunteers were further asked if they would be willing to double test their students by administering a paper form of the test in addition to the online assessment. Studies were carried out for grade 7 mathematics in spring 2003 and for mathematics (grades 4, 7, and 10) and reading (grades 5, 8, and 11) in The studies reported no evidence of mode effects for any of the tests evaluated. However, some of the findings may have been confounded by order-of-administration effects and limited samples of students for whom testing order could be reliably identified. If a mode effect for reading did exist, it is not clear whether the design carried out could have identified it, and if so, whether a sufficient statistical adjustment could have been applied. Because only a subset of students taking the KCA also took the paper test, it would not have been possible to assign each online student the higher of two scores. In this paper, we present results from two online comparability studies that were conducted for the Texas statewide assessment program in spring The purpose of the studies were to evaluate the comparability of online and paper versions of the Texas Assessment of Knowledge and Skills (TAKS) in mathematics, reading/english language arts, science and social studies at grades 8 and 11 for the purposes of test score reporting, and to appropriately adjust equated score conversion tables for students testing online as warranted. In the sections that follow, we will describe the TAKS program and initial efforts to transition the program to online testing, introduce the design and methodology used for the comparability studies at each grade level, and present results of the score comparability studies conducted at grades 8 and 11. 3

5 In particular, we will introduce an approach and design to studying the comparability of online and paper tests that we refer to as matched samples comparability analyses (MSCA). We believe this approach is particularly well-suited to monitoring comparability as states transition their high-stakes testing programs to online testing. In the last section of this paper, we will report on some additional analyses that evaluate the sensitivity of the MSCA approach for detecting differences in online and paper group performance when these groups differ in terms of overall proficiency. The TAKS Program and Online Testing TAKS is the primary state-mandated assessment in Texas, and represents the latest and most comprehensive testing implementation of statewide assessments in Texas that have been ongoing for more than 20 years. First administered in spring 2003, TAKS is given to students in mathematics at grades 3 10 and at the exit level (grade 11); in reading at grades 3 9; in writing at grades 4 and 7; in English language arts (ELA) at Grade 10 and at the exit level; in science at grades 5, 8, and 10 and at the exit level; and in social studies at grades 8 and 10 and at the exit level. Spanish versions of TAKS are available at grades 3 6. Every TAKS test is directly aligned to the Texas Essential Knowledge and Skills (TEKS) curriculum. On each TAKS test, the critical knowledge and skills are measured by a series of test objectives. These objectives are not found verbatim in the TEKS curriculum. Rather, the objectives are umbrella statements that serve as headings under which student expectations from the TEKS can be meaningfully grouped. TAKS test results are used to comply with the requirements of the No Child Left Behind (NCLB) act, as well as for statewide accountability purposes. The exit level TAKS is part of high school graduation requirements in Texas and is offered multiple times to students who do not pass. Test results are reported to teachers and parents, and are used for instructional decisions as appropriate. The TAKS tests are scaled separately at each grade, with a score of 2100 representing met standard and 2400 representing commended performance at each grade level. In practice, the highest equated scale score below these thresholds is set to these threshold values. Additional information on the TAKS can be found on the Texas Education Agency (TEA) web site at The TEA first began testing by computer in fall 2002, when an end-of-course examination in Algebra I was made available online and districts were given the option of using 4

6 this test either in online or paper format. In spring 2004, an online testing pilot was carried out in three grade 8 TAKS subject areas, reading, mathematics, and social studies. The goals of the pilot were to determine the administrative procedures necessary to deliver online assessments in the schools, to assess the readiness of Texas school districts to administer online assessments, to document administrative challenges, and to the extent possible, to compare performance on online assessments with paper test performance. The pilot tests were administered in volunteering campuses during a two-week window prior to the operational grade 8 TAKS administration. Although data related to online performance were collected, the design of the pilot did not permit conclusive comparisons of online and paper performance. In spring 2005, the TEA carried out additional studies of online testing at grades 8 and 11 to compare online and paper test performance in reading, mathematics, social studies, and science. Score comparability for science was assessed only at grade 11, although a science fieldtest at grade 8 included an online component. The grade 8 and 11 studies involved different data collection designs. At grade 8, schools that volunteered to participate were randomly assigned to administer one of the three TAKS content areas online. The same test form was administered both in paper and online. Each student tested only one time in a given content area; thus, the results for students testing online were to be reported as part of the statewide assessment results. At grade 11 (exit level) TAKS, a special re-test administration was offered in June. Students in the participating schools who had not yet passed exit-level TAKS in at least one of the four subject areas were offered an extra testing opportunity as part of this administration. In addition, a small number of students that would be entering grade 11 in the fall were allowed to participate in the administration (these students will be referred to as rising juniors ). For each exit-level TAKS subject area, volunteering students in these schools were randomly assigned to take either an online or a paper version of the same test form. Research Methodology The comparability study design required conducting analyses that would support score adjustments for those students testing online, if such adjustments were warranted. To accomplish this, we utilized an approach that considered score comparability in the context of test equating. Specifically, we equated the online version of the tests to the paper version of the 5

7 tests under the assumptions of a random groups design. The details of how the equatings were accomplished differed for grade 8 and grade 11, as described below. Matched Samples Comparability Analyses for Grade 8 For grade 8, we initially thought that the comparability data could be analyzed based on random assignments to condition at the school level, as it was expected that approximately 40 schools would administer each of the three content areas online. However, voluntary participation for the comparability study was much lower than expected, and the numbers of schools testing in each subject area was too small to support analyses based on random assignment at the school level. As a result, we compared test performance for students testing online with comparison groups from the paper results that were matched to the online students in terms of spring 2004 test performance. We refer to this approach as matched samples comparability analyses (MSCA). In this approach, student scale scores for reading and mathematics obtained in grade 7 were used as matching variables, and sub-samples of students equal to the numbers of students testing online were selected from the paper TAKS tests. The paper students were selected so that the grade 7 reading and mathematics scores in the online and matched paper groups were identical. In devising this approach, we first regressed 2004 grade 8 TAKS scale scores on 2003 grade 7 TAKS scale scores. We found the following multiple correlations across reading, math, and social studies (note that there is no grade 7 social studies test): Dependent Variable Independent Variable(s) r G8ReadingSS G7ReadingSS 0.74 G8ReadingSS G7ReadingSS, G7MathSS 0.76 G8MathSS G7MathSS 0.82 G8MathSS G7ReadingSS, G7MathSS 0.83 G8SocSS G7ReadingSS, G7MathSS 0.72 The MSCA involved a bootstrap method that was designed to establish raw to scale score conversions by equating the online form to the paper form, and also to estimate bootstrap standard errors of the equating to assist in interpreting differences between the online and paper score conversions (c.f., Kolen & Brennan, 2004, p ). The application of equating 6

8 methods was based on an assumption that the online and matched paper sample groups were randomly equivalent. For each replication, we used IRT true score equating based on Rasch calibrations of the online and paper samples using the WINSTEPS program (Linacre, 2001). The MSCA involved sampling with replacement, in which both online and matched paper student samples were drawn 500 times and analyses were repeated for each replicated sample. The specific procedures used in the MSCA were as follows: 1. Each student testing online with grade 7 TAKS score in reading and mathematics was matched to a student from the available 2005 paper TAKS data with identical grade 7 reading and mathematics scale scores. Both reading and mathematics were used in the matching for all three grade 8 subject areas. 2. Online versus paper comparability analyses were performed using the matched groups of students by repeating the following steps 500 times: a. A bootstrap sample of students (i.e., random sampling with replacement) was drawn from the online participants. b. A matched stratified bootstrap sample (i.e., random sampling with replacement at each combination of mathematics and reading scores observed in the online sample drawn in step 2.a) was drawn from the available 2005 paper TAKS data. c. A raw score-to-raw score equating was carried out with each of the bootstrap samples as follows: i. WINSTEPS was used to calibrate the online group data, centering the calibrations so that the mean of the ability estimates was zero. The item difficulty estimates and raw score-to-theta conversions were retained. ii. WINSTEPS was used to calibrate the paper comparison group data, centering the calibrations so that the mean of the ability estimates was zero. The item difficulty estimates and raw score-to-theta conversions were retained. iii. IRT true score equating was used to find the raw score equivalents on the paper comparison group to each integer raw score for the online group by calculating ΣP(θ), where the summation is over the paper item difficulty estimates and θ is from the conversions for the integer raw score found in step 2.c.i. iv. Using linear interpolation and the unrounded operational raw score-to-scale score conversions, the paper raw score equivalents found in step 3 were converted to scale score equivalents. d. The raw score equivalents were transformed to scale scores using the operational 2005 score conversion tables and linear interpolation. 3. Online scale score conversions for each raw score were based on the average of the conversions calculated over each of the 500 replications. These average scale score values comprised the alternate online raw score to scale conversion table. 4. The standard deviation of online scaled score conversions at each raw score represented the conditional bootstrap standard errors of the linking. To assist in comparing the online and paper score conversions, we considered the following criterion suggested by Dorans and Lawrence (1990): To assess equivalence, it is 7

9 convenient to compute the difference between the equating function and the identity transformation, and to divide this difference by the standard error of equating. If the resultant ratio falls within a bandwidth of plus or minus two, then the equating function is deemed to be within sampling error of the identity function (p. 247). It should be pointed out that the Dorans and Lawrence criterion is only one of many justifiable approaches that could be used to interpret the results. We also paid special attention to differences in the range of scaled scores around the met standard score levels. Differences at extremes of the scale were considered less important, given the purpose and primary uses of the TAKS tests. Grade 11 Comparability Analyses For the grade 11 comparability analyses, the researchers involved in the study randomly assigned the participating students from each school to the online or paper testing conditions. Because testing occurred over a single day for each subject area and many of the participating schools were limited in how many students they could test in a single day, slightly more students were assigned to the paper condition than to the online condition. To evaluate score comparability for the grade 11 study, we employed some of the same procedures that we used in the MSCA analyses for grade 8. Specifically, we randomly selected students from the online and paper samples with replacement 500 times and equated the scores obtained in each sampling replication. These bootstrap analyses resulted in alternate online score conversion tables for each test and bootstrap standard errors of equating to assist in interpreting results. One difference between the grade 11 and the grade 8 analyses was the bootstrap replications involved simple random sampling with replacement, that is, that there was no need to select a sample from the paper group that was matched to the online sample in terms of previous test scores. Another difference was that the bootstrap analyses for grade 11 ELA incorporated polytomously-scored constructed response and extended essay item types. Results Matched Samples Comparability Analyses for Grade 8 Table 1 presents the means and standard deviations of the grade 8 raw scores and grade 7 scale scores for each test evaluated using the MSCA. It can be seen in Table 1 that the mean raw scores on the grade 8 tests for the online and paper groups are similar (within 0.16) for all three 8

10 tests. The grade 7 reading and mathematics scale scores used with the MSCA were very similar for the mathematics and social studies online and paper samples (within 7 points). However, for reading the previous scale scores were noticeably higher for the online group compared with the paper group (e.g., the mean reading scale score was about 18 points higher and the mean mathematics scale score was about 12 points higher). Insert Table 1 about here Tables 2 to 4 summarize the comparability analysis results for mathematics, reading, and social studies. The columns of the tables are as follows: RS Paper test raw score CBT_RS Equivalent raw scores on the online test based on the MSCA equating. Note that a higher equivalent raw score indicates that the online version of the test was more difficult. RS_SD Standard deviation of the equivalent raw scores over the 500 replications. PAP_SS Paper test scale score conversions, based on the 2005 TAKS equating results CBT_SS Equivalent scale scores on the online test based on the MSCA equating. Again, higher equivalent scale scores indicate that the online version of the test was more difficult. SS_SD Standard deviation of the equivalent scale scores over the 500 replications. RS_DIF Difference between online raw score equivalent and paper raw score. SS_DIF Difference between online scale score equivalent and paper scale score. SIG? Scale score differences exceeding two bootstrap standard errors are noted by **. Insert Tables 2 to 4 about here In these tables, the equating conversions for the online and paper forms are assumed to be the same for zero and perfect scores, since true score equating conversions cannot be estimated with the Rasch model at these score points. For mathematics (Table 2), the online versus paper differences were slight. In terms of the raw score conversions, the differences were never as much as one-half of a point. In terms of scaled score conversions, the differences were less than five points over most of the scale. However, at the upper raw score points (41 and higher), scaled score differences exceeded two standard errors of the linking. For reading (Table 3), large differences occurred throughout the scale. Differences in raw score conversions exceeded one and a half points over much of the score range. Differences 9

11 in scale score conversions were over 20 points over most of the score range. All of the differences in scale score conversions exceeded two standard errors of the linking. For social studies (Table 4) slight differences in both raw score and scale score conversions occurred. The raw score differences were never as much as one-half of a point, and the scale score differences were never as much as six points. None of the scale score differences exceeded two standard errors of the linking. Figure 1 presents differences between the online and paper scale score conversions graphically as a function of raw score, along with upper and lower intervals defined by plus and minus two bootstrap standard errors of equating. These graphs provide a relatively concise summary of the patterns found in Tables 2 through 4. Insert Figure 1 about here Grade 11 Comparability Results Table 5 presents univariate statistics for the online and paper groups participating in the grade 11 comparability study. These data indicate higher raw scores for the paper group compared to the online group for mathematics, science, and ELA. For social studies, the raw score mean is slightly higher for the online group than for the paper group. Insert Table 5 about here Unlike the other exit level TAKS tests, which consist entirely of objectively scored items (i.e., multiple-choice and a small number of grid-in response items for mathematics), ELA contains both short-answer open-ended items and an extended essay item. Table 6 presents univariate statistics for the online and paper groups on the ELA test broken down by the different item types. The first entry in the table (ELA) is for the unweighted sum of the items and has a possible maximum of 61 points (48 points for multiple-choice, 9 points for short-answer, and four points for the essay). The second entry is weighted (ELA_WT), with the essay counting four times its maximum point total. The maximum possible weighted score is 73. The scale score for the ELA test is based on the weighted raw score. For the multiple-choice raw score, the essay 10

12 score, and two of the three open-ended scores, the mean for the paper group was slightly higher than the mean for the online group. Insert Table 6 about here For many of the students participating in the grade 11 study, a previous TAKS score was available. These scores are listed in Table 7 and include entries both for students for whom the previously available TAKS score was from the grade 10 test (mostly the rising juniors ) and students for whom the previously available TAKS score was from the exit level test. Insert Table 7 about here Note that previous TAKS scores means were substantially higher for students for whom the previous score was from the grade 10 test than it was for students for whom the previously available scale score was from the exit level test. 1 This was expected, given that the rising juniors participating in the study were higher achieving students generally, as compared with grade 11 students who had previously failed the exit level TAKS. For all four tests, the previous TAKS score means were very similar (within three scale score points) for online and paper students with a previous score from the grade 11 test. For mathematics, science, and social studies, students testing online had higher previous TAKS score means from grade 10 than students testing by paper. Although these scale score differences were comparatively large (16.2, 18.6, and 12.2, respectively), they seem less of a concern considering the small numbers of students and the much larger scale score standard deviations for the students with a previous TAKS scores from a grade 10 test. Considering all of the information from the analysis of previous TAKS scores, we concluded that the assumption of randomly equivalent online and paper groups was reasonable for the grade 11 TAKS comparability analyses. Figure 2 and 3 present graphs of the differences between the grade 11 online and paper scale scores as a function of raw score, along with upper and lower intervals defined by ±2 1 Strictly speaking, the mean scale scores on the grade 10 and 11 tests are not directly comparable, since the scales are uniquely defined within each grade. However, since met standard and commended performance are defined at 2100 and 2400 for both grades, general comparative inferences seem reasonable. 11

13 bootstrap standard errors of equating. Figure 2 present results for mathematics and ELA and Figure 3 presents results for science and social studies. Insert Figures 2 and 3 about here The results for grade 11 mathematics presented in Figure 2 are very similar to the results for grade 8, except that both the online minus paper scale score differences and the intervals defined by ±2 bootstrap standard errors are larger than they were for grade 8 mathematics. For grade 11 mathematics, the online minus paper raw score differences were as high as 0.77 of a raw score point, and these differences occurred in the region of the met standard cut score. Thus, even though the scale score differences for grade 11 mathematics were within ±2 bootstrap standard errors, there was evidence of a greater mode effect for grade 11 mathematics than there was for grade 8 mathematics. The results for grade 11 ELA presented in Figure 2 also indicated that the test was more difficult for the online group than for the paper group, but as with grade 11 mathematics the scale score differences were within ±2 bootstrap standard errors of equating over most of the score range. The online minus paper differences were never as large as one weighted raw score point, although the largest difference (0.95) occurred in the region of the met standard cut score. The differences at extremely high score levels (i.e., above a weighted raw score of 65) reflect the fact that ELA scale score conversions change by large amounts with changes in the weighted raw scores in this region of the scale. For example, in the paper conversion table, a weighted raw score of 70 corresponded to a scale score of 2802, and a weighted raw score of 71 corresponded to a scale score of Thus, the online minus paper raw score difference of about of a point in this region of the scale converted to a scale score difference of about 50 points. In addition, the Rasch true score equating was not accurate in this region of the scale because of the limited numbers of high performing students participating in the study and the relative difficulty of the open-ended constructed response items. 2 2 In fact, to obtain Rasch scaling tables that extended to the entire range of possible scores, it was necessary to augment both the online and paper ELA samples with an imputed item response record that included maximum scores on the two open-ended items for which no student in either the online or paper samples obtained the maximum possible score (see Table 6). 12

14 The results for grade 11 science and social studies presented in Figure 3 indicate little or no evidence of mode effects between the students testing online and the students testing by paper. The results for science indicated the online version was slightly more difficult than the paper version, but the raw score differences were never as high as one-half of a raw score point. The scale score differences ranged between 3 and 7 score points over most of the scale, and these differences never exceeded ±2 bootstrap standard errors. The social studies results indicated that the online form was slightly easier than the paper version. Raw score differences were never more than 0.40 of a raw score point and scale score differences were six points or lower over most of the scale. The social studies differences never exceeded ±2 bootstrap standard errors, although the bootstrap standard errors of equating for social studies were much larger than they were for the other tests because there were only 355 online students and 388 paper students taking the social studies test. Summary of the Grade 8 and Grade 11 Comparability Study Results To summarize the results of the grade 8 and grade 11 comparability studies, there was evidence across grades and content areas that the online versions of the TAKS tests were more difficult than the paper versions. In grade 8 reading, the mode differences were quite pronounced and warranted the use of the alternate score conversion table for reporting online results. In grade 11 mathematics and ELA, the differences were less pronounced and the ELA results were also complicated by the contributions of constructed response and extended essay items to the total scores. Nevertheless, the alternate score conversions were used for reporting scores with these tests, in part because of the magnitudes of raw score differences but also because of the high stakes associated with these tests. For the social studies tests, there was little evidence of mode effects across the two grades, since differences slightly favored the paper group at grade 8 and slightly favored the online group at grade 11. The comparability results for grade 8 mathematics and grade 11 science also favored the paper groups, although differences were slight and within the ±2 bootstrap standard errors of equating for nearly all score points. In general, the results of the comparability analyses for the TAKS tests at grades 8 and 11 were consistent with the existing literature on the comparability of online and paper assessments in that the tests where the most significant mode differences were detected involved reading 13

15 passages that required scrolling. The mode differences in mathematics, although not large, were less consistent with the comparability literature, which mostly supports thecomparability of online and paper mathematics tests. Keng, McClarty and Davis (2006) further investigate the mode differences found for these measures through item-level analyses. Sensitivity of the MSCA Approach Although the MSCA method appeared to work well in the context of the grade 8 TAKS online tests, the conditions for the matched sample analyses were reasonably good in that the ability levels of the paper and online groups (based on previous test scores) were reasonably similar. In reviewing results of the analyses, technical advisors working with the state of Texas recommended that the performance of the MCSA method should be studied further to see how sensitive it is under conditions where the online group and the paper group are less similar in overall ability. Such documentation is important given that the MCSA approach has been used to determine and potentially apply alternate score conversions for students taking operational TAKS tests online. To address this recommendation, additional sensitivity analyses of the MCSA method were carried out. The purpose of these analyses was to answer two specific questions: 1. How will the matched sample analyses perform when no mode differences exist but the online group and paper group differ in ability based on past test performance? 2. Will the matched sample analyses recover simulated mode differences when the online and paper groups differ in ability based on past test performance? Sensitivity Analysis Procedures The general approach for the sensitivity analyses was to select samples of students from the paper data used in the spring 2005 grade 8 comparability study and to carry out matched sample analyses as if these samples were students testing online. Analyses were conducted for mathematics and reading. Four sets of analyses were undertaken. The first set utilized six mathematics data sets and six readings data sets drawn from the overall paper data for each measure. For each of the six data sets for a given test, a different target frequency distribution was established for sampling students. The variables used to sample the data were the previous 14

16 spring s scale scores (mathematics scale scores for mathematics, reading scale scores for reading). The sample sizes were 1,275 for mathematics data sets and 1,850 for reading data sets, roughly equivalent to the numbers of Spring 2005 grade online testers in these subjects. Tables 8 and 9 list the scale score frequencies and score means of the six selected sensitivity samples and the overall paper group. For both mathematics and reading, performance increased from sample 1 to sample 6, and sample 4 was proportionally equivalent to the overall paper data. Insert Tables 8 and 9 about here The second, third, and fourth set of analyses simulated mode differences between the online and paper groups. The data for these analyses were created by systematically modifying the data sets from the first set of analyses to lower performance on the 2005 test. Three conditions of lower performance (0.25, 0.5, and 1.0 raw score points, respectively) were simulated for each data set from the first set of samples. To accomplish this, the responses to randomly selected items were changed from correct to incorrect for approximately one-half of the students. (Only one-half of the records were altered to ensure that some perfect and nearperfect scores remained in the data). Because the process of changing responses from correct to incorrect was random, it was built into the bootstrap replications. The SAS code to accomplish this change is shown below: compare=1/rawold*&mode.*2; rchange=ranuni(-1); if rchange>0.5 then do; do i=1 to &nitem.; if items{i}=1 then do; if compare>ranuni(-1) then items{i}=0; end; end; end; The variable, rawold, is the student s original raw score, and &mode is equal to 0.25, 0.5, or 1.0, depending upon the condition. The SAS function ranuni generates a uniform random variable between zero and one. The total number of conditions in the sensitivity analyses was 48 (2 content areas 6 samples 4 sets of analyses). For each condition, we ran matched sample comparability analyses involving 100 bootstrap replications according to the steps outlined above. To 15

17 summarize results within a condition, the differences in equating conversions between the paper and simulated online forms were evaluated. Results No Mode Effects Simulated Figure 4 presents the differences between the online score conversions resulting from the 100 bootstrap replications and the reported paper test scale score conversions. These differences are graphed as a function of the paper form raw score. The bootstrap standard errors of the linking for the online conversions are not shown in these graphs. For mathematics, the differences ranged from about 3.5 to 4.5 scale score points across the six samples at most score points. For reading, the differences ranged from about 4.5 to 5.5 across the six samples at most score points. For both mathematics and reading, the bootstrap standard errors were higher at the extreme score points, with a pattern similar to bootstrap standard errors from for the Spring 2005 mathematics and reading comparabilty analyses that are presented in Tables 2 and 3 and Figure1. Insert Figure 4 about here In general, the sensitivity analysis results suggested that the MSCA method is unlikely to indicate either statistical significant or practically significant performance differences between online and paper groups in situations where no true differences exist. Moreover, the differences observed between the online and paper groups based on the matched samples analyses were not related to the overall proficiency differences between the simulated online and paper samples. Sample M2 from the mathematics simulation resulted in the largest differences between the simulated online and paper groups. For this sample, the median raw score difference over 100 replications was about However, this difference favored the online group over the paper group. Since the online versus paper scale score differences were within two bootstrap standard errors and there is currently no reason to hypothesize that the online group would be advantaged by mode-of-administration effects for the TAKS program, these results would not have led to any score adjustments for the online group. Results Mode Effects Simulated Figures 5 to 7 present results of the sensitivity analyses when mode effects were simulated. Figure 5 presents results based on a simulated mode effect of 0.25 raw score points, 16

18 Figure 6 presents results based on a simulated mode effect of 0.50 raw score points, and Figure 7 presents results based on a simulated mode effect of one raw score point. Insert Figures 5 to 7 about here The results in presented in Figure 5 suggest that the significance criterion of two bootstrap standard errors was, for the most part, too conservative to identify a simulated mode effect of 0.25 of a raw score point. For mathematics, the results varied over the six simulation samples. For samples M1 and M4, scale score differences exceeded two bootstrap standard errors over nearly all score points, suggesting a significant mode effect that would disadvantage online students. For samples M3, M5, and M6, differences indicated that the simulated online test was more difficult, but the differences were within 10 scale score points and two bootstrap standard errors. For sample M2, no mode effects were indicated. In the case of reading, the results over the six simulated samples were more consistent. In all cases, the simulated online test was more difficult. However, scale score differences were less than 10 points across virtually all score points for all simulated reading data sets, which was within the significance criterion of two bootstrap standard errors. The results presented in Figure 6 indicated that the matched samples comparability analyses consistently detected simulated mode effects of 0.5 raw score points. For all mathematics samples except M2 and all reading samples, the scale score differences exceeded two bootstrap standard errors of the linkings. The average scale score differences were between 10 and 20 points for most of these samples. As would be expected, the results shown in Figure 7 indicated that the simulated mode effect of 1.0 raw score points was detected in all mathematics and reading samples. Evidence of mode effects for these data sets was unequivocal. Discussion of the MSCA Sensitivity Analyses In general, the sensitivity analyses supported the MCSA approach. The method does not seem to be affected by differences in the ability of the group taking an online test versus the comparison paper test takers, at least within the range of differences studied here. One reason for 17

19 this robustness might be the difference in sample sizes between the online and paper groups. In the grade 8 mathematics and reading comparability studies, the paper groups were larger than the online groups by factors of 125 and 85, respectively. It is not clear from these analyses whether the MCSA approach sensitivity analyses will work as well if the relative sample sizes of the online and paper groups become more similar. However, this seems unlikely to happen in Texas, at least in the near future. Not surprisingly, the method did not appear to be robust in detecting a simulated mode effect of 0.25 raw score points. In part, this is a function of how conservative or liberal one is in evaluating the results. The criterion of two bootstrap standard errors of the linkings seemed, in the context of the data studied here, a somewhat conservative criterion. From looking at Figure 5, one could argue that a more liberal criteria (in the sense of being more willing to apply a separate set of score conversions for the paper group) might have led to a decision to adjust the online scores for three of the six samples for both reading and mathematics. Of course, as with any statistical analysis, the power was related to sample size. To the extent that future online comparability analyses in Texas involve increasing online sample sizes, the two bootstrap standard errors criterion will be less likely to be considered conservative. At some point, other considerations may carry more weight in evaluations, such the magnitude of raw score-to-raw score equating differences. One finding from the sensitivity analyses that would be worth further study was the range of differences across the six simulated online data sets, particularly in the sensitivity analyses done for mathematics. A limitation of the study was that the same six samples were used to study both the conditions where no mode effects were simulated and the conditions where various levels of mode effects were simulated (since the mode effects simulated data sets were created by randomly changing item responses from the no mode effects data). One of the mathematics simulation samples (sample M2) was drawn by chance in such a way that the matched samples comparability analyses suggested higher performance for the online group when no mode effects were present. Sample M2 was drawn with a targeted distribution of previous scores to be of lower overall performance than the paper group; however, this does not seem to explain the anomalous results for sample M2. Rather, it seems that some significant level of sampling variation in the selection of sample M2 occurred that was related to the 18

20 relationship between the previous test scores (e.g., the Spring 2004 grade 7 mathematics and reading scale scores) and the criterion score (Spring 2005 grade 8 mathematics raw scores). Thus, a more extensive set of simulations that incorporated the variation in sampling simulated online test takers would be helpful in assessing the extent to which this might be a concern. It might also help to inform decision rules regarding significant mode effects. One final comment about the sensitivity analyses carried out in this study is that future online comparability studies in Texas will involve matching on a different set of criteria than those used for the Spring 2005 grade 8 study. For example, an attractive alternative matching approach would be to create target frequencies based not only on previous scale scores but also on other important demographic variables such as gender, ethnicity, and English language proficiency. Because of the extremely large paper group sample sizes, a fairly refined sampling grid could be defined that incorporates most or all of these variables, although it would be necessary to group previous scale scores into intervals to prevent empty cells in the sampling grid. A limitation of the current study is that it did not examine the sensitivity of the MSCA approach to different ways of matching performance between the online and paper groups besides using previously obtained scale scores. We are currently undertaking such sensitivity analyses and will use the results to inform the design of Spring 2006 comparability analyses for the TAKS tests. Conclusions In K-12 testing, the current advantages and future promises of online testing have reached a tipping point that is encouraging virtually every state to consider or pursue online testing initiatives as part of their testing program. It is easy to envision that K-12 assessments will be administered almost exclusively online within the foreseeable future. In the enthusiasm to embrace what Bennett (2002) refers to the inexorable and inevitable evolution of technology and assessment, it is tempting to downplay or dismiss the comparability of online and paper assessments. Nevertheless, state testing programs and the vendors that serve these programs are clearly obliged to address the issue of score comparability between online and paper versions of K-12 assessments, especially given the high-stakes that results from assessments have taken on in recent years. 19

21 The strategy Texas has adopted for introducing online testing is similar to the strategy that many states are using, where online testing is made available to those districts and schools that are willing and able to pursue it. The comparability studies presented in this paper illustrate how responsible and psychometrically defensible comparability analyses can be incorporated within the constraints of a high-stakes, operational testing program. In Texas, the MSCA approach is a central part of the strategy to offer online and paper versions of TAKS tests sideby-side as the districts and schools in the state transition to online testing. By routinely including these analyses both when online versions of tests are introduced and as they continue to be offered, it will be possible to monitor the comparability of online and paper tests over time. Although this approach will not be without challenges, it seems to be an equitable and viable approach to a difficult assessment problem. 20

22 References American Psychological Association Committee on Professional Standards and Committee on Psychological Tests and Assessments (APA) (1986). Guidelines for computer-based tests and interpretations. Washington, DC: Author. American Educational Research Association (AERA), American Psychological Association (APA), and the National Council on Measurement in Education (NCME). (1999). Standards for educational and psychological testing. Washington, DC: AERA. Bennett, R.E. (2002). Inexorable and inevitable: The continuing story of technology and assessment. Journal of Technology, Learning, and Assessmen, 1(1). Available from Bergstrom, B. (1992, April). Ability measure equivalence of computer adaptive and pencil and paper tests: A research synthesis. Paper presented at the annual meeting of the American Educational Research Association: San Francisco. Bridgeman, B., Lennon, M.L., & Jackenthal, A. (2001). Effects of screen size, screen resolution, and display rate on computer-based test performance (ETS RR-01-23). Princeton, NJ: Educational Testing Service. Choi, S.W. & Tinkler, T. (2002). Evaluating comparability of paper and computer-based assessment in a K-12 setting. Paper presented at the Annual Meeting of the National Council on Measurement in Education, New Orleans, LA. Dorans, N. J., & Lawrence, I. M. (1990). Checking the statistical equivalence of nearly identical test forms. Applied Measurement in Education, 3, Glasnapp, D.R., Poggio, J., Poggio, A., & Yang, X. (2005). Student Attitudes and Perceptions Regarding Computerized Testing and the Relationship to Performance in Large Scale Assessment Programs. Paper presented at the annual meeting of the National Council on Measurement in Education, Montreal, CA. 21