Score Comparability of Online and Paper Administrations of the Texas Assessment of Knowledge and Skills. Walter D. Way. Laurie Laughlin Davis
|
|
- Timothy Lester
- 8 years ago
- Views:
Transcription
1 Score Comparability of Online and Paper Administrations of the Texas Assessment of Knowledge and Skills Walter D. Way Laurie Laughlin Davis Steven Fitzpatrick Pearson Educational Measurement Paper presented at the annual meeting of the National Council on Measurement in Education, San Francisco, CA, April, 2006
2 Score Comparability of Online and Paper Administrations of the Texas Assessment of Knowledge and Skills Introduction A rapidly increasing number of state education departments are exploring or implementing online assessments as part of their statewide assessment programs. The potential advantages of online testing in K-12 settings are obvious. These include quicker turnaround of results, cost savings related to printing and shipping paper test materials, improved test security, more flexible and less burdensome test administrations, and a technological basis for introducing innovative item formats and test delivery algorithms. In addition, recent surveys indicate that students testing online enjoy their experiences, feel comfortable with taking tests by computer, and tend to prefer it to traditional paper testing (Glassnapp, Poggio, Poggio, & Yang, 2005; O Malley et al., 2005; Ito & Sykes, 2004). In states where online testing has been introduced as part of their high-stakes assessments, not all schools have had the infrastructure and equipment to test online. For this reason, paper and online versions of the same tests are typically offered side-by-side. Any time both paper-based and online assessments co-exist, professional testing standards indicate the need to ensure comparable results across paper and online mediums. The Guidelines for Computer-Based Tests and Interpretations (APA, 1986) states:...when interpreting scores from the computerized versions of conventional tests, the equivalence of scores from computerized versions should be established and documented before using norms or cut scores obtained from conventional tests. (p. 18). The joint Standards for Educational and Psychological Testing also recommends empirical validation of score interpretations across computer-based and paperbased tests (AERA, APA, NCME, 1999, Standard 4.10). The comparability of test scores based on online versus paper testing has been studied for more than 20 years. Reviews of the comparability literature research were reported by Mazzeo and Harvey (1988), who reported mixed results, and Drasgow (1993), who concluded that there were essentially no differences in examinee scores by mode-of-administration for power tests. Paek (2005) provided a summary of more recent comparability research and concluded that, in general, computer and paper versions of traditional multiple-choice tests are comparable across grades and academic subjects. However, when tests are timed, differential speededness can lead 1
3 to mode effects. For example, a recent study by Ito and Sykes (2004) reported significantly lower performance on timed web-based norm-referenced tests at grades 4-12 compared with paper versions. These differences seemed to occur because students needed more time on the web-based test than they did on the paper test. Pommerich (2004) reported evidence of mode differences due to differential speededness in tests given at grades 11 and 12, but in her study online performance on questions near the end of several tests was higher than paper performance on these same items. She hypothesized that students who are rushed for time might actually benefit from testing online because the computer makes it easier to respond and move quickly from item to item. A number of studies have suggested that no mode differences can be expected when individual test items can be presented within a single screen (Poggio, Glassnapp, Yang, & Poggio, 2005; Hetter, Segall & Bloxom, 1997; Bergstrom, 1992; Spray, Ackerman, Reckase, & Carlson, 1989). However, when items are associated with text that requires scrolling, such as is typically the case with reading tests, studies have indicated lower performance for students testing online (O Malley, 2005; Pommerich, 2004; Bridgeman, Lennon, & Jackenthal, 2003; Choi & Tinkler, 2002; Bergstrom, 1992). In general, the results of comparability research are difficult to evaluate for several reasons. First, there has been a continual evolution in both computer technology and the computer skills of test-takers. Thus, earlier studies have limited generalizability, and more recent studies may not generalize well to future settings. Second, most comparability research is carried out in the context of operational testing programs, where less-than-desirable experimental control is usually the norm. In such studies, conclusions are often tempered because of design limitations such as lack of random assignment, insufficient statistical power, order-ofadministration effects, and effects due to differences in test forms given across modes. Finally, the content areas, test designs, test administration systems, and testing populations can differ considerably across comparability studies, and differences in any of these factors could lead to different findings from one study to another. For a policy maker interested in introducing online assessments for a high-stakes K-12 testing program, the need to assess comparability creates a number of challenges. While some stakeholders will lobby for immediate and widespread introduction of online testing, researchers 2
4 and psychometricians will advise more cautious and controlled experimental studies. Such studies can be expensive and usually require efforts beyond those needed to meet the usual challenges associated with the ongoing paper-based program. Furthermore, no matter how well a comparability study is designed, executing the design depends on the volunteer participation of individual schools and districts. As such, one can expect that schools will vary in their ability to execute the procedures called for in the experimental design, and that a nontrivial number of schools signed up for the study will invariably drop out. Poggio et al. (2005) and Poggio, Glasnapp, Yang, Beauchamp, and Dunham, (2005) reported on an approach to comparability research in the live context of Kansas assessment program that balanced an aggressive approach to online implementation with the need to collect comparability data. In their studies, all schools were invited to administer the Kansas Computerized Assessment (KCA), and online volunteers were further asked if they would be willing to double test their students by administering a paper form of the test in addition to the online assessment. Studies were carried out for grade 7 mathematics in spring 2003 and for mathematics (grades 4, 7, and 10) and reading (grades 5, 8, and 11) in The studies reported no evidence of mode effects for any of the tests evaluated. However, some of the findings may have been confounded by order-of-administration effects and limited samples of students for whom testing order could be reliably identified. If a mode effect for reading did exist, it is not clear whether the design carried out could have identified it, and if so, whether a sufficient statistical adjustment could have been applied. Because only a subset of students taking the KCA also took the paper test, it would not have been possible to assign each online student the higher of two scores. In this paper, we present results from two online comparability studies that were conducted for the Texas statewide assessment program in spring The purpose of the studies were to evaluate the comparability of online and paper versions of the Texas Assessment of Knowledge and Skills (TAKS) in mathematics, reading/english language arts, science and social studies at grades 8 and 11 for the purposes of test score reporting, and to appropriately adjust equated score conversion tables for students testing online as warranted. In the sections that follow, we will describe the TAKS program and initial efforts to transition the program to online testing, introduce the design and methodology used for the comparability studies at each grade level, and present results of the score comparability studies conducted at grades 8 and 11. 3
5 In particular, we will introduce an approach and design to studying the comparability of online and paper tests that we refer to as matched samples comparability analyses (MSCA). We believe this approach is particularly well-suited to monitoring comparability as states transition their high-stakes testing programs to online testing. In the last section of this paper, we will report on some additional analyses that evaluate the sensitivity of the MSCA approach for detecting differences in online and paper group performance when these groups differ in terms of overall proficiency. The TAKS Program and Online Testing TAKS is the primary state-mandated assessment in Texas, and represents the latest and most comprehensive testing implementation of statewide assessments in Texas that have been ongoing for more than 20 years. First administered in spring 2003, TAKS is given to students in mathematics at grades 3 10 and at the exit level (grade 11); in reading at grades 3 9; in writing at grades 4 and 7; in English language arts (ELA) at Grade 10 and at the exit level; in science at grades 5, 8, and 10 and at the exit level; and in social studies at grades 8 and 10 and at the exit level. Spanish versions of TAKS are available at grades 3 6. Every TAKS test is directly aligned to the Texas Essential Knowledge and Skills (TEKS) curriculum. On each TAKS test, the critical knowledge and skills are measured by a series of test objectives. These objectives are not found verbatim in the TEKS curriculum. Rather, the objectives are umbrella statements that serve as headings under which student expectations from the TEKS can be meaningfully grouped. TAKS test results are used to comply with the requirements of the No Child Left Behind (NCLB) act, as well as for statewide accountability purposes. The exit level TAKS is part of high school graduation requirements in Texas and is offered multiple times to students who do not pass. Test results are reported to teachers and parents, and are used for instructional decisions as appropriate. The TAKS tests are scaled separately at each grade, with a score of 2100 representing met standard and 2400 representing commended performance at each grade level. In practice, the highest equated scale score below these thresholds is set to these threshold values. Additional information on the TAKS can be found on the Texas Education Agency (TEA) web site at The TEA first began testing by computer in fall 2002, when an end-of-course examination in Algebra I was made available online and districts were given the option of using 4
6 this test either in online or paper format. In spring 2004, an online testing pilot was carried out in three grade 8 TAKS subject areas, reading, mathematics, and social studies. The goals of the pilot were to determine the administrative procedures necessary to deliver online assessments in the schools, to assess the readiness of Texas school districts to administer online assessments, to document administrative challenges, and to the extent possible, to compare performance on online assessments with paper test performance. The pilot tests were administered in volunteering campuses during a two-week window prior to the operational grade 8 TAKS administration. Although data related to online performance were collected, the design of the pilot did not permit conclusive comparisons of online and paper performance. In spring 2005, the TEA carried out additional studies of online testing at grades 8 and 11 to compare online and paper test performance in reading, mathematics, social studies, and science. Score comparability for science was assessed only at grade 11, although a science fieldtest at grade 8 included an online component. The grade 8 and 11 studies involved different data collection designs. At grade 8, schools that volunteered to participate were randomly assigned to administer one of the three TAKS content areas online. The same test form was administered both in paper and online. Each student tested only one time in a given content area; thus, the results for students testing online were to be reported as part of the statewide assessment results. At grade 11 (exit level) TAKS, a special re-test administration was offered in June. Students in the participating schools who had not yet passed exit-level TAKS in at least one of the four subject areas were offered an extra testing opportunity as part of this administration. In addition, a small number of students that would be entering grade 11 in the fall were allowed to participate in the administration (these students will be referred to as rising juniors ). For each exit-level TAKS subject area, volunteering students in these schools were randomly assigned to take either an online or a paper version of the same test form. Research Methodology The comparability study design required conducting analyses that would support score adjustments for those students testing online, if such adjustments were warranted. To accomplish this, we utilized an approach that considered score comparability in the context of test equating. Specifically, we equated the online version of the tests to the paper version of the 5
7 tests under the assumptions of a random groups design. The details of how the equatings were accomplished differed for grade 8 and grade 11, as described below. Matched Samples Comparability Analyses for Grade 8 For grade 8, we initially thought that the comparability data could be analyzed based on random assignments to condition at the school level, as it was expected that approximately 40 schools would administer each of the three content areas online. However, voluntary participation for the comparability study was much lower than expected, and the numbers of schools testing in each subject area was too small to support analyses based on random assignment at the school level. As a result, we compared test performance for students testing online with comparison groups from the paper results that were matched to the online students in terms of spring 2004 test performance. We refer to this approach as matched samples comparability analyses (MSCA). In this approach, student scale scores for reading and mathematics obtained in grade 7 were used as matching variables, and sub-samples of students equal to the numbers of students testing online were selected from the paper TAKS tests. The paper students were selected so that the grade 7 reading and mathematics scores in the online and matched paper groups were identical. In devising this approach, we first regressed 2004 grade 8 TAKS scale scores on 2003 grade 7 TAKS scale scores. We found the following multiple correlations across reading, math, and social studies (note that there is no grade 7 social studies test): Dependent Variable Independent Variable(s) r G8ReadingSS G7ReadingSS 0.74 G8ReadingSS G7ReadingSS, G7MathSS 0.76 G8MathSS G7MathSS 0.82 G8MathSS G7ReadingSS, G7MathSS 0.83 G8SocSS G7ReadingSS, G7MathSS 0.72 The MSCA involved a bootstrap method that was designed to establish raw to scale score conversions by equating the online form to the paper form, and also to estimate bootstrap standard errors of the equating to assist in interpreting differences between the online and paper score conversions (c.f., Kolen & Brennan, 2004, p ). The application of equating 6
8 methods was based on an assumption that the online and matched paper sample groups were randomly equivalent. For each replication, we used IRT true score equating based on Rasch calibrations of the online and paper samples using the WINSTEPS program (Linacre, 2001). The MSCA involved sampling with replacement, in which both online and matched paper student samples were drawn 500 times and analyses were repeated for each replicated sample. The specific procedures used in the MSCA were as follows: 1. Each student testing online with grade 7 TAKS score in reading and mathematics was matched to a student from the available 2005 paper TAKS data with identical grade 7 reading and mathematics scale scores. Both reading and mathematics were used in the matching for all three grade 8 subject areas. 2. Online versus paper comparability analyses were performed using the matched groups of students by repeating the following steps 500 times: a. A bootstrap sample of students (i.e., random sampling with replacement) was drawn from the online participants. b. A matched stratified bootstrap sample (i.e., random sampling with replacement at each combination of mathematics and reading scores observed in the online sample drawn in step 2.a) was drawn from the available 2005 paper TAKS data. c. A raw score-to-raw score equating was carried out with each of the bootstrap samples as follows: i. WINSTEPS was used to calibrate the online group data, centering the calibrations so that the mean of the ability estimates was zero. The item difficulty estimates and raw score-to-theta conversions were retained. ii. WINSTEPS was used to calibrate the paper comparison group data, centering the calibrations so that the mean of the ability estimates was zero. The item difficulty estimates and raw score-to-theta conversions were retained. iii. IRT true score equating was used to find the raw score equivalents on the paper comparison group to each integer raw score for the online group by calculating ΣP(θ), where the summation is over the paper item difficulty estimates and θ is from the conversions for the integer raw score found in step 2.c.i. iv. Using linear interpolation and the unrounded operational raw score-to-scale score conversions, the paper raw score equivalents found in step 3 were converted to scale score equivalents. d. The raw score equivalents were transformed to scale scores using the operational 2005 score conversion tables and linear interpolation. 3. Online scale score conversions for each raw score were based on the average of the conversions calculated over each of the 500 replications. These average scale score values comprised the alternate online raw score to scale conversion table. 4. The standard deviation of online scaled score conversions at each raw score represented the conditional bootstrap standard errors of the linking. To assist in comparing the online and paper score conversions, we considered the following criterion suggested by Dorans and Lawrence (1990): To assess equivalence, it is 7
9 convenient to compute the difference between the equating function and the identity transformation, and to divide this difference by the standard error of equating. If the resultant ratio falls within a bandwidth of plus or minus two, then the equating function is deemed to be within sampling error of the identity function (p. 247). It should be pointed out that the Dorans and Lawrence criterion is only one of many justifiable approaches that could be used to interpret the results. We also paid special attention to differences in the range of scaled scores around the met standard score levels. Differences at extremes of the scale were considered less important, given the purpose and primary uses of the TAKS tests. Grade 11 Comparability Analyses For the grade 11 comparability analyses, the researchers involved in the study randomly assigned the participating students from each school to the online or paper testing conditions. Because testing occurred over a single day for each subject area and many of the participating schools were limited in how many students they could test in a single day, slightly more students were assigned to the paper condition than to the online condition. To evaluate score comparability for the grade 11 study, we employed some of the same procedures that we used in the MSCA analyses for grade 8. Specifically, we randomly selected students from the online and paper samples with replacement 500 times and equated the scores obtained in each sampling replication. These bootstrap analyses resulted in alternate online score conversion tables for each test and bootstrap standard errors of equating to assist in interpreting results. One difference between the grade 11 and the grade 8 analyses was the bootstrap replications involved simple random sampling with replacement, that is, that there was no need to select a sample from the paper group that was matched to the online sample in terms of previous test scores. Another difference was that the bootstrap analyses for grade 11 ELA incorporated polytomously-scored constructed response and extended essay item types. Results Matched Samples Comparability Analyses for Grade 8 Table 1 presents the means and standard deviations of the grade 8 raw scores and grade 7 scale scores for each test evaluated using the MSCA. It can be seen in Table 1 that the mean raw scores on the grade 8 tests for the online and paper groups are similar (within 0.16) for all three 8
10 tests. The grade 7 reading and mathematics scale scores used with the MSCA were very similar for the mathematics and social studies online and paper samples (within 7 points). However, for reading the previous scale scores were noticeably higher for the online group compared with the paper group (e.g., the mean reading scale score was about 18 points higher and the mean mathematics scale score was about 12 points higher). Insert Table 1 about here Tables 2 to 4 summarize the comparability analysis results for mathematics, reading, and social studies. The columns of the tables are as follows: RS Paper test raw score CBT_RS Equivalent raw scores on the online test based on the MSCA equating. Note that a higher equivalent raw score indicates that the online version of the test was more difficult. RS_SD Standard deviation of the equivalent raw scores over the 500 replications. PAP_SS Paper test scale score conversions, based on the 2005 TAKS equating results CBT_SS Equivalent scale scores on the online test based on the MSCA equating. Again, higher equivalent scale scores indicate that the online version of the test was more difficult. SS_SD Standard deviation of the equivalent scale scores over the 500 replications. RS_DIF Difference between online raw score equivalent and paper raw score. SS_DIF Difference between online scale score equivalent and paper scale score. SIG? Scale score differences exceeding two bootstrap standard errors are noted by **. Insert Tables 2 to 4 about here In these tables, the equating conversions for the online and paper forms are assumed to be the same for zero and perfect scores, since true score equating conversions cannot be estimated with the Rasch model at these score points. For mathematics (Table 2), the online versus paper differences were slight. In terms of the raw score conversions, the differences were never as much as one-half of a point. In terms of scaled score conversions, the differences were less than five points over most of the scale. However, at the upper raw score points (41 and higher), scaled score differences exceeded two standard errors of the linking. For reading (Table 3), large differences occurred throughout the scale. Differences in raw score conversions exceeded one and a half points over much of the score range. Differences 9
11 in scale score conversions were over 20 points over most of the score range. All of the differences in scale score conversions exceeded two standard errors of the linking. For social studies (Table 4) slight differences in both raw score and scale score conversions occurred. The raw score differences were never as much as one-half of a point, and the scale score differences were never as much as six points. None of the scale score differences exceeded two standard errors of the linking. Figure 1 presents differences between the online and paper scale score conversions graphically as a function of raw score, along with upper and lower intervals defined by plus and minus two bootstrap standard errors of equating. These graphs provide a relatively concise summary of the patterns found in Tables 2 through 4. Insert Figure 1 about here Grade 11 Comparability Results Table 5 presents univariate statistics for the online and paper groups participating in the grade 11 comparability study. These data indicate higher raw scores for the paper group compared to the online group for mathematics, science, and ELA. For social studies, the raw score mean is slightly higher for the online group than for the paper group. Insert Table 5 about here Unlike the other exit level TAKS tests, which consist entirely of objectively scored items (i.e., multiple-choice and a small number of grid-in response items for mathematics), ELA contains both short-answer open-ended items and an extended essay item. Table 6 presents univariate statistics for the online and paper groups on the ELA test broken down by the different item types. The first entry in the table (ELA) is for the unweighted sum of the items and has a possible maximum of 61 points (48 points for multiple-choice, 9 points for short-answer, and four points for the essay). The second entry is weighted (ELA_WT), with the essay counting four times its maximum point total. The maximum possible weighted score is 73. The scale score for the ELA test is based on the weighted raw score. For the multiple-choice raw score, the essay 10
12 score, and two of the three open-ended scores, the mean for the paper group was slightly higher than the mean for the online group. Insert Table 6 about here For many of the students participating in the grade 11 study, a previous TAKS score was available. These scores are listed in Table 7 and include entries both for students for whom the previously available TAKS score was from the grade 10 test (mostly the rising juniors ) and students for whom the previously available TAKS score was from the exit level test. Insert Table 7 about here Note that previous TAKS scores means were substantially higher for students for whom the previous score was from the grade 10 test than it was for students for whom the previously available scale score was from the exit level test. 1 This was expected, given that the rising juniors participating in the study were higher achieving students generally, as compared with grade 11 students who had previously failed the exit level TAKS. For all four tests, the previous TAKS score means were very similar (within three scale score points) for online and paper students with a previous score from the grade 11 test. For mathematics, science, and social studies, students testing online had higher previous TAKS score means from grade 10 than students testing by paper. Although these scale score differences were comparatively large (16.2, 18.6, and 12.2, respectively), they seem less of a concern considering the small numbers of students and the much larger scale score standard deviations for the students with a previous TAKS scores from a grade 10 test. Considering all of the information from the analysis of previous TAKS scores, we concluded that the assumption of randomly equivalent online and paper groups was reasonable for the grade 11 TAKS comparability analyses. Figure 2 and 3 present graphs of the differences between the grade 11 online and paper scale scores as a function of raw score, along with upper and lower intervals defined by ±2 1 Strictly speaking, the mean scale scores on the grade 10 and 11 tests are not directly comparable, since the scales are uniquely defined within each grade. However, since met standard and commended performance are defined at 2100 and 2400 for both grades, general comparative inferences seem reasonable. 11
13 bootstrap standard errors of equating. Figure 2 present results for mathematics and ELA and Figure 3 presents results for science and social studies. Insert Figures 2 and 3 about here The results for grade 11 mathematics presented in Figure 2 are very similar to the results for grade 8, except that both the online minus paper scale score differences and the intervals defined by ±2 bootstrap standard errors are larger than they were for grade 8 mathematics. For grade 11 mathematics, the online minus paper raw score differences were as high as 0.77 of a raw score point, and these differences occurred in the region of the met standard cut score. Thus, even though the scale score differences for grade 11 mathematics were within ±2 bootstrap standard errors, there was evidence of a greater mode effect for grade 11 mathematics than there was for grade 8 mathematics. The results for grade 11 ELA presented in Figure 2 also indicated that the test was more difficult for the online group than for the paper group, but as with grade 11 mathematics the scale score differences were within ±2 bootstrap standard errors of equating over most of the score range. The online minus paper differences were never as large as one weighted raw score point, although the largest difference (0.95) occurred in the region of the met standard cut score. The differences at extremely high score levels (i.e., above a weighted raw score of 65) reflect the fact that ELA scale score conversions change by large amounts with changes in the weighted raw scores in this region of the scale. For example, in the paper conversion table, a weighted raw score of 70 corresponded to a scale score of 2802, and a weighted raw score of 71 corresponded to a scale score of Thus, the online minus paper raw score difference of about of a point in this region of the scale converted to a scale score difference of about 50 points. In addition, the Rasch true score equating was not accurate in this region of the scale because of the limited numbers of high performing students participating in the study and the relative difficulty of the open-ended constructed response items. 2 2 In fact, to obtain Rasch scaling tables that extended to the entire range of possible scores, it was necessary to augment both the online and paper ELA samples with an imputed item response record that included maximum scores on the two open-ended items for which no student in either the online or paper samples obtained the maximum possible score (see Table 6). 12
14 The results for grade 11 science and social studies presented in Figure 3 indicate little or no evidence of mode effects between the students testing online and the students testing by paper. The results for science indicated the online version was slightly more difficult than the paper version, but the raw score differences were never as high as one-half of a raw score point. The scale score differences ranged between 3 and 7 score points over most of the scale, and these differences never exceeded ±2 bootstrap standard errors. The social studies results indicated that the online form was slightly easier than the paper version. Raw score differences were never more than 0.40 of a raw score point and scale score differences were six points or lower over most of the scale. The social studies differences never exceeded ±2 bootstrap standard errors, although the bootstrap standard errors of equating for social studies were much larger than they were for the other tests because there were only 355 online students and 388 paper students taking the social studies test. Summary of the Grade 8 and Grade 11 Comparability Study Results To summarize the results of the grade 8 and grade 11 comparability studies, there was evidence across grades and content areas that the online versions of the TAKS tests were more difficult than the paper versions. In grade 8 reading, the mode differences were quite pronounced and warranted the use of the alternate score conversion table for reporting online results. In grade 11 mathematics and ELA, the differences were less pronounced and the ELA results were also complicated by the contributions of constructed response and extended essay items to the total scores. Nevertheless, the alternate score conversions were used for reporting scores with these tests, in part because of the magnitudes of raw score differences but also because of the high stakes associated with these tests. For the social studies tests, there was little evidence of mode effects across the two grades, since differences slightly favored the paper group at grade 8 and slightly favored the online group at grade 11. The comparability results for grade 8 mathematics and grade 11 science also favored the paper groups, although differences were slight and within the ±2 bootstrap standard errors of equating for nearly all score points. In general, the results of the comparability analyses for the TAKS tests at grades 8 and 11 were consistent with the existing literature on the comparability of online and paper assessments in that the tests where the most significant mode differences were detected involved reading 13
15 passages that required scrolling. The mode differences in mathematics, although not large, were less consistent with the comparability literature, which mostly supports thecomparability of online and paper mathematics tests. Keng, McClarty and Davis (2006) further investigate the mode differences found for these measures through item-level analyses. Sensitivity of the MSCA Approach Although the MSCA method appeared to work well in the context of the grade 8 TAKS online tests, the conditions for the matched sample analyses were reasonably good in that the ability levels of the paper and online groups (based on previous test scores) were reasonably similar. In reviewing results of the analyses, technical advisors working with the state of Texas recommended that the performance of the MCSA method should be studied further to see how sensitive it is under conditions where the online group and the paper group are less similar in overall ability. Such documentation is important given that the MCSA approach has been used to determine and potentially apply alternate score conversions for students taking operational TAKS tests online. To address this recommendation, additional sensitivity analyses of the MCSA method were carried out. The purpose of these analyses was to answer two specific questions: 1. How will the matched sample analyses perform when no mode differences exist but the online group and paper group differ in ability based on past test performance? 2. Will the matched sample analyses recover simulated mode differences when the online and paper groups differ in ability based on past test performance? Sensitivity Analysis Procedures The general approach for the sensitivity analyses was to select samples of students from the paper data used in the spring 2005 grade 8 comparability study and to carry out matched sample analyses as if these samples were students testing online. Analyses were conducted for mathematics and reading. Four sets of analyses were undertaken. The first set utilized six mathematics data sets and six readings data sets drawn from the overall paper data for each measure. For each of the six data sets for a given test, a different target frequency distribution was established for sampling students. The variables used to sample the data were the previous 14
16 spring s scale scores (mathematics scale scores for mathematics, reading scale scores for reading). The sample sizes were 1,275 for mathematics data sets and 1,850 for reading data sets, roughly equivalent to the numbers of Spring 2005 grade online testers in these subjects. Tables 8 and 9 list the scale score frequencies and score means of the six selected sensitivity samples and the overall paper group. For both mathematics and reading, performance increased from sample 1 to sample 6, and sample 4 was proportionally equivalent to the overall paper data. Insert Tables 8 and 9 about here The second, third, and fourth set of analyses simulated mode differences between the online and paper groups. The data for these analyses were created by systematically modifying the data sets from the first set of analyses to lower performance on the 2005 test. Three conditions of lower performance (0.25, 0.5, and 1.0 raw score points, respectively) were simulated for each data set from the first set of samples. To accomplish this, the responses to randomly selected items were changed from correct to incorrect for approximately one-half of the students. (Only one-half of the records were altered to ensure that some perfect and nearperfect scores remained in the data). Because the process of changing responses from correct to incorrect was random, it was built into the bootstrap replications. The SAS code to accomplish this change is shown below: compare=1/rawold*&mode.*2; rchange=ranuni(-1); if rchange>0.5 then do; do i=1 to &nitem.; if items{i}=1 then do; if compare>ranuni(-1) then items{i}=0; end; end; end; The variable, rawold, is the student s original raw score, and &mode is equal to 0.25, 0.5, or 1.0, depending upon the condition. The SAS function ranuni generates a uniform random variable between zero and one. The total number of conditions in the sensitivity analyses was 48 (2 content areas 6 samples 4 sets of analyses). For each condition, we ran matched sample comparability analyses involving 100 bootstrap replications according to the steps outlined above. To 15
17 summarize results within a condition, the differences in equating conversions between the paper and simulated online forms were evaluated. Results No Mode Effects Simulated Figure 4 presents the differences between the online score conversions resulting from the 100 bootstrap replications and the reported paper test scale score conversions. These differences are graphed as a function of the paper form raw score. The bootstrap standard errors of the linking for the online conversions are not shown in these graphs. For mathematics, the differences ranged from about 3.5 to 4.5 scale score points across the six samples at most score points. For reading, the differences ranged from about 4.5 to 5.5 across the six samples at most score points. For both mathematics and reading, the bootstrap standard errors were higher at the extreme score points, with a pattern similar to bootstrap standard errors from for the Spring 2005 mathematics and reading comparabilty analyses that are presented in Tables 2 and 3 and Figure1. Insert Figure 4 about here In general, the sensitivity analysis results suggested that the MSCA method is unlikely to indicate either statistical significant or practically significant performance differences between online and paper groups in situations where no true differences exist. Moreover, the differences observed between the online and paper groups based on the matched samples analyses were not related to the overall proficiency differences between the simulated online and paper samples. Sample M2 from the mathematics simulation resulted in the largest differences between the simulated online and paper groups. For this sample, the median raw score difference over 100 replications was about However, this difference favored the online group over the paper group. Since the online versus paper scale score differences were within two bootstrap standard errors and there is currently no reason to hypothesize that the online group would be advantaged by mode-of-administration effects for the TAKS program, these results would not have led to any score adjustments for the online group. Results Mode Effects Simulated Figures 5 to 7 present results of the sensitivity analyses when mode effects were simulated. Figure 5 presents results based on a simulated mode effect of 0.25 raw score points, 16
18 Figure 6 presents results based on a simulated mode effect of 0.50 raw score points, and Figure 7 presents results based on a simulated mode effect of one raw score point. Insert Figures 5 to 7 about here The results in presented in Figure 5 suggest that the significance criterion of two bootstrap standard errors was, for the most part, too conservative to identify a simulated mode effect of 0.25 of a raw score point. For mathematics, the results varied over the six simulation samples. For samples M1 and M4, scale score differences exceeded two bootstrap standard errors over nearly all score points, suggesting a significant mode effect that would disadvantage online students. For samples M3, M5, and M6, differences indicated that the simulated online test was more difficult, but the differences were within 10 scale score points and two bootstrap standard errors. For sample M2, no mode effects were indicated. In the case of reading, the results over the six simulated samples were more consistent. In all cases, the simulated online test was more difficult. However, scale score differences were less than 10 points across virtually all score points for all simulated reading data sets, which was within the significance criterion of two bootstrap standard errors. The results presented in Figure 6 indicated that the matched samples comparability analyses consistently detected simulated mode effects of 0.5 raw score points. For all mathematics samples except M2 and all reading samples, the scale score differences exceeded two bootstrap standard errors of the linkings. The average scale score differences were between 10 and 20 points for most of these samples. As would be expected, the results shown in Figure 7 indicated that the simulated mode effect of 1.0 raw score points was detected in all mathematics and reading samples. Evidence of mode effects for these data sets was unequivocal. Discussion of the MSCA Sensitivity Analyses In general, the sensitivity analyses supported the MCSA approach. The method does not seem to be affected by differences in the ability of the group taking an online test versus the comparison paper test takers, at least within the range of differences studied here. One reason for 17
19 this robustness might be the difference in sample sizes between the online and paper groups. In the grade 8 mathematics and reading comparability studies, the paper groups were larger than the online groups by factors of 125 and 85, respectively. It is not clear from these analyses whether the MCSA approach sensitivity analyses will work as well if the relative sample sizes of the online and paper groups become more similar. However, this seems unlikely to happen in Texas, at least in the near future. Not surprisingly, the method did not appear to be robust in detecting a simulated mode effect of 0.25 raw score points. In part, this is a function of how conservative or liberal one is in evaluating the results. The criterion of two bootstrap standard errors of the linkings seemed, in the context of the data studied here, a somewhat conservative criterion. From looking at Figure 5, one could argue that a more liberal criteria (in the sense of being more willing to apply a separate set of score conversions for the paper group) might have led to a decision to adjust the online scores for three of the six samples for both reading and mathematics. Of course, as with any statistical analysis, the power was related to sample size. To the extent that future online comparability analyses in Texas involve increasing online sample sizes, the two bootstrap standard errors criterion will be less likely to be considered conservative. At some point, other considerations may carry more weight in evaluations, such the magnitude of raw score-to-raw score equating differences. One finding from the sensitivity analyses that would be worth further study was the range of differences across the six simulated online data sets, particularly in the sensitivity analyses done for mathematics. A limitation of the study was that the same six samples were used to study both the conditions where no mode effects were simulated and the conditions where various levels of mode effects were simulated (since the mode effects simulated data sets were created by randomly changing item responses from the no mode effects data). One of the mathematics simulation samples (sample M2) was drawn by chance in such a way that the matched samples comparability analyses suggested higher performance for the online group when no mode effects were present. Sample M2 was drawn with a targeted distribution of previous scores to be of lower overall performance than the paper group; however, this does not seem to explain the anomalous results for sample M2. Rather, it seems that some significant level of sampling variation in the selection of sample M2 occurred that was related to the 18
20 relationship between the previous test scores (e.g., the Spring 2004 grade 7 mathematics and reading scale scores) and the criterion score (Spring 2005 grade 8 mathematics raw scores). Thus, a more extensive set of simulations that incorporated the variation in sampling simulated online test takers would be helpful in assessing the extent to which this might be a concern. It might also help to inform decision rules regarding significant mode effects. One final comment about the sensitivity analyses carried out in this study is that future online comparability studies in Texas will involve matching on a different set of criteria than those used for the Spring 2005 grade 8 study. For example, an attractive alternative matching approach would be to create target frequencies based not only on previous scale scores but also on other important demographic variables such as gender, ethnicity, and English language proficiency. Because of the extremely large paper group sample sizes, a fairly refined sampling grid could be defined that incorporates most or all of these variables, although it would be necessary to group previous scale scores into intervals to prevent empty cells in the sampling grid. A limitation of the current study is that it did not examine the sensitivity of the MSCA approach to different ways of matching performance between the online and paper groups besides using previously obtained scale scores. We are currently undertaking such sensitivity analyses and will use the results to inform the design of Spring 2006 comparability analyses for the TAKS tests. Conclusions In K-12 testing, the current advantages and future promises of online testing have reached a tipping point that is encouraging virtually every state to consider or pursue online testing initiatives as part of their testing program. It is easy to envision that K-12 assessments will be administered almost exclusively online within the foreseeable future. In the enthusiasm to embrace what Bennett (2002) refers to the inexorable and inevitable evolution of technology and assessment, it is tempting to downplay or dismiss the comparability of online and paper assessments. Nevertheless, state testing programs and the vendors that serve these programs are clearly obliged to address the issue of score comparability between online and paper versions of K-12 assessments, especially given the high-stakes that results from assessments have taken on in recent years. 19
21 The strategy Texas has adopted for introducing online testing is similar to the strategy that many states are using, where online testing is made available to those districts and schools that are willing and able to pursue it. The comparability studies presented in this paper illustrate how responsible and psychometrically defensible comparability analyses can be incorporated within the constraints of a high-stakes, operational testing program. In Texas, the MSCA approach is a central part of the strategy to offer online and paper versions of TAKS tests sideby-side as the districts and schools in the state transition to online testing. By routinely including these analyses both when online versions of tests are introduced and as they continue to be offered, it will be possible to monitor the comparability of online and paper tests over time. Although this approach will not be without challenges, it seems to be an equitable and viable approach to a difficult assessment problem. 20
22 References American Psychological Association Committee on Professional Standards and Committee on Psychological Tests and Assessments (APA) (1986). Guidelines for computer-based tests and interpretations. Washington, DC: Author. American Educational Research Association (AERA), American Psychological Association (APA), and the National Council on Measurement in Education (NCME). (1999). Standards for educational and psychological testing. Washington, DC: AERA. Bennett, R.E. (2002). Inexorable and inevitable: The continuing story of technology and assessment. Journal of Technology, Learning, and Assessmen, 1(1). Available from Bergstrom, B. (1992, April). Ability measure equivalence of computer adaptive and pencil and paper tests: A research synthesis. Paper presented at the annual meeting of the American Educational Research Association: San Francisco. Bridgeman, B., Lennon, M.L., & Jackenthal, A. (2001). Effects of screen size, screen resolution, and display rate on computer-based test performance (ETS RR-01-23). Princeton, NJ: Educational Testing Service. Choi, S.W. & Tinkler, T. (2002). Evaluating comparability of paper and computer-based assessment in a K-12 setting. Paper presented at the Annual Meeting of the National Council on Measurement in Education, New Orleans, LA. Dorans, N. J., & Lawrence, I. M. (1990). Checking the statistical equivalence of nearly identical test forms. Applied Measurement in Education, 3, Glasnapp, D.R., Poggio, J., Poggio, A., & Yang, X. (2005). Student Attitudes and Perceptions Regarding Computerized Testing and the Relationship to Performance in Large Scale Assessment Programs. Paper presented at the annual meeting of the National Council on Measurement in Education, Montreal, CA. 21
Computer-based testing: An alternative for the assessment of Turkish undergraduate students
Available online at www.sciencedirect.com Computers & Education 51 (2008) 1198 1204 www.elsevier.com/locate/compedu Computer-based testing: An alternative for the assessment of Turkish undergraduate students
More informationAbstract Title Page Not included in page count.
Abstract Title Page Not included in page count. Title: The Impact of The Stock Market Game on Financial Literacy and Mathematics Achievement: Results from a National Randomized Controlled Trial. Author(s):
More informationCalculator Use on Stanford Series Mathematics Tests
assessment report Calculator Use on Stanford Series Mathematics Tests Thomas E Brooks, PhD Betsy J Case, PhD Tracy Cerrillo, PhD Nancy Severance Nathan Wall Michael J Young, PhD May 2003 (Revision 1, November
More informationBenchmark Assessment in Standards-Based Education:
Research Paper Benchmark Assessment in : The Galileo K-12 Online Educational Management System by John Richard Bergan, Ph.D. John Robert Bergan, Ph.D. and Christine Guerrera Burnham, Ph.D. Submitted by:
More informationThe State of Texas Assessment of Academic Readiness
The State of Texas Assessment of Academic Readiness Parent Information Meeting Presented By Dr. Jodi Duron February 2012 A WALK DOWN MEMORY LANE TABS 1979 Exit level test question in 9 th Grade (3rd grade
More informationChapter 6: The Information Function 129. CHAPTER 7 Test Calibration
Chapter 6: The Information Function 129 CHAPTER 7 Test Calibration 130 Chapter 7: Test Calibration CHAPTER 7 Test Calibration For didactic purposes, all of the preceding chapters have assumed that the
More informationStability of School Building Accountability Scores and Gains. CSE Technical Report 561. Robert L. Linn CRESST/University of Colorado at Boulder
Stability of School Building Accountability Scores and Gains CSE Technical Report 561 Robert L. Linn CRESST/University of Colorado at Boulder Carolyn Haug University of Colorado at Boulder April 2002 Center
More informationInstructional coaching at selected middle schools in south Texas and effects on student achievement
Instructional coaching at selected middle schools in south Texas and effects on student achievement ABSTRACT SantaPaula Gama Garcia Texas A&M University Kingsville Don Jones Texas A&M University Kingsville
More informationChapter 5: Analysis of The National Education Longitudinal Study (NELS:88)
Chapter 5: Analysis of The National Education Longitudinal Study (NELS:88) Introduction The National Educational Longitudinal Survey (NELS:88) followed students from 8 th grade in 1988 to 10 th grade in
More information2013 New Jersey Alternate Proficiency Assessment. Executive Summary
2013 New Jersey Alternate Proficiency Assessment Executive Summary The New Jersey Alternate Proficiency Assessment (APA) is a portfolio assessment designed to measure progress toward achieving New Jersey
More informationA BRIEF GUIDE TO SELECTING AND USING PRE-POST ASSESSMENTS
A BRIEF GUIDE TO SELECTING AND USING PRE-POST ASSESSMENTS Prepared by the National Evaluation and Technical Assistance Center for the Education of Children and Youth who are Neglected, Delinquent, and
More informationThe Effects of Read Naturally on Grade 3 Reading: A Study in the Minneapolis Public Schools
The Effects of Read Naturally on Grade 3 Reading: A Study in the Minneapolis Public Schools David Heistad, Ph.D. Director of Research, Evaluation, and Assessment Minneapolis Public Schools Introduction
More informationUtah Comprehensive Counseling and Guidance Program Evaluation Report
Utah Comprehensive Counseling and Guidance Program Evaluation Report John Carey and Karen Harrington Center for School Counseling Outcome Research School of Education University of Massachusetts Amherst
More informationSENSITIVITY ANALYSIS AND INFERENCE. Lecture 12
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this
More informationGMAC. Predicting Graduate Program Success for Undergraduate and Intended Graduate Accounting and Business-Related Concentrations
GMAC Predicting Graduate Program Success for Undergraduate and Intended Graduate Accounting and Business-Related Concentrations Kara M. Owens and Eileen Talento-Miller GMAC Research Reports RR-06-15 August
More informationState of Texas Assessments of Academic Readiness (STAAR TM ) Questions and Answers (Q&As) Updated March 9, 2012
State of Texas Assessments of Academic Readiness (STAAR TM ) Questions and Answers (Q&As) Updated March 9, 2012 [As policies are finalized, these Q&As will be updated. Updates are marked with.] Note that
More informationPredicting Successful Completion of the Nursing Program: An Analysis of Prerequisites and Demographic Variables
Predicting Successful Completion of the Nursing Program: An Analysis of Prerequisites and Demographic Variables Introduction In the summer of 2002, a research study commissioned by the Center for Student
More informationA Matched Study of Washington State 10th Grade Assessment Scores of Students in Schools Using The Core-Plus Mathematics Program
A ed Study of Washington State 10th Grade Assessment Scores of Students in Schools Using The Core-Plus Mathematics Program By Reggie Nelson Mathematics Department Chair Burlington-Edison School District
More informationUnderstanding Your Praxis Scores
2014 15 Understanding Your Praxis Scores The Praxis Series Assessments are developed and administered by Educational Testing Service (E T S ). Praxis Core Academic Skills for Educators (Core) tests measure
More informationGlossary of Terms Ability Accommodation Adjusted validity/reliability coefficient Alternate forms Analysis of work Assessment Battery Bias
Glossary of Terms Ability A defined domain of cognitive, perceptual, psychomotor, or physical functioning. Accommodation A change in the content, format, and/or administration of a selection procedure
More informationState of Texas Assessments of Academic Readiness (STAAR) Frequently Asked Questions For Parents and Students
State of Texas Assessments of Academic Readiness (STAAR) Frequently Asked Questions For Parents and Students Table of Contents STAAR GENERAL QUESTIONS... 2 STAAR END-OF-COURSE (EOC) TESTS FOR HIGH SCHOOL
More informationMAP Reports. Teacher Report (by Goal Descriptors) Displays teachers class data for current testing term sorted by RIT score.
Goal Performance: These columns summarize the students performance in the goal strands tested in this subject. Data will display in these columns only if a student took a Survey w/ Goals test. Goal performance
More informationAbstract Title Page. Title: Conditions for the Effectiveness of a Tablet-Based Algebra Program
Abstract Title Page. Title: Conditions for the Effectiveness of a Tablet-Based Algebra Program Authors and Affiliations: Andrew P. Jaciw Megan Toby Boya Ma Empirical Education Inc. SREE Fall 2012 Conference
More informationBASI Manual Addendum: Growth Scale Value (GSV) Scales and College Report
BASI Manual Addendum: Growth Scale Value (GSV) Scales and College Report Growth Scale Value (GSV) Scales Overview The BASI GSV scales were created to help users measure the progress of an individual student
More informationFairfield Public Schools
Mathematics Fairfield Public Schools AP Statistics AP Statistics BOE Approved 04/08/2014 1 AP STATISTICS Critical Areas of Focus AP Statistics is a rigorous course that offers advanced students an opportunity
More informationThe Validity Case for Assessing Direct Writing by Computer
The Validity Case for Assessing Direct Writing by Computer A Pearson Assessments & Information White Paper Walter D. Way, Ph.D. Laurie L. Davis, Ph.D. Ellen Strain-Seymour, Ph.D. April 2008 1 The Validity
More information62 EDUCATION NEXT / WINTER 2013 educationnext.org
Pictured is Bertram Generlette, formerly principal at Piney Branch Elementary in Takoma Park, Maryland, and now principal at Montgomery Knolls Elementary School in Silver Spring, Maryland. 62 EDUCATION
More informationACT Research Explains New ACT Test Writing Scores and Their Relationship to Other Test Scores
ACT Research Explains New ACT Test Writing Scores and Their Relationship to Other Test Scores Wayne J. Camara, Dongmei Li, Deborah J. Harris, Benjamin Andrews, Qing Yi, and Yong He ACT Research Explains
More information2004 Annual Academic Assessment for Psychology Schreiner University
2004 Annual Academic Assessment for Psychology Schreiner University 1. State the educational purpose of the assessment program: The Schreiner University psychology graduate demonstrates higher order thinking
More informationInnovative Educational Practice: Using Virtual Labs in the Secondary Classroom
Innovative Educational Practice: Using Virtual Labs in the Secondary Classroom Marcel Satsky Kerr, Texas Wesleyan University. Kimberly Rynearson, Tarleton State University. Marcus C. Kerr, Texas Wesleyan
More informationNOT AN OFFICIAL SCORE REPORT. Summary of Results
From SAT TEST MARCH 8, 214 Summary of Results Page 1 of 1 Congratulations on taking the SAT Reasoning Test! You re showing colleges that you are serious about getting an education. The SAT is one indicator
More informationHB 4150: Effects of Essential Learning Skills on High School Graduation Rates
HB 4150: Effects of Essential Learning Skills on High School Graduation Rates Oregon Department of Education December 2014 http://www.ode.state.or.us Effects of Essential Learning Skills on High School
More informationPolicy Capture for Setting End-of-Course and Kentucky Performance Rating for Educational Progress (K-PREP) Cut Scores
2013 No. 007 Policy Capture for Setting End-of-Course and Kentucky Performance Rating for Educational Progress (K-PREP) Cut Scores Prepared for: Authors: Kentucky Department of Education Capital Plaza
More informationEDUCATION RESEARCH CONSULTANT
JOB DESCRIPTION MICHIGAN CIVIL SERVICE COMMISSION JOB SPECIFICATION EDUCATION RESEARCH CONSULTANT Employees in this job function as professional Education Research Consultants and Psychometricians, completing
More informationTransadaptation: Publishing Assessments in World Languages
assessment report. : Publishing Assessments in World Languages........ Sasha Zucker Margarita Miska Linda G. Alaniz Luis Guzmán September 2005 : Publishing Assessments in World Languages Introduction In
More informationDATA COLLECTION AND ANALYSIS
DATA COLLECTION AND ANALYSIS Quality Education for Minorities (QEM) Network HBCU-UP Fundamentals of Education Research Workshop Gerunda B. Hughes, Ph.D. August 23, 2013 Objectives of the Discussion 2 Discuss
More informationA Guide to Understanding and Using Data for Effective Advocacy
A Guide to Understanding and Using Data for Effective Advocacy Voices For Virginia's Children Voices For V Child We encounter data constantly in our daily lives. From newspaper articles to political campaign
More informationKyndra V. Middleton, Ph.D. 2441 4 th St. NW Washington, DC 20059 (202) 806-5342 Email: kyndra.middleton@howard.edu
, Ph.D. 2441 4 th St. NW Washington, DC 20059 (202) 806-5342 Email: kyndra.middleton@howard.edu Education Ph.D., Educational Measurement and Statistics, The University of Iowa, Iowa City, IA, July 2007
More informationTEAS V National Standard Setting Study 2010 Executive Summary
TEAS V National Standard Setting Study 2010 Executive Summary The purpose of the TEAS V National Standard Setting Study was to develop a set of recommended criterion-referenced cut scores that nursing
More informationEvaluating Analytical Writing for Admission to Graduate Business Programs
Evaluating Analytical Writing for Admission to Graduate Business Programs Eileen Talento-Miller, Kara O. Siegert, Hillary Taliaferro GMAC Research Reports RR-11-03 March 15, 2011 Abstract The assessment
More informationGMAC. Predicting Success in Graduate Management Doctoral Programs
GMAC Predicting Success in Graduate Management Doctoral Programs Kara O. Siegert GMAC Research Reports RR-07-10 July 12, 2007 Abstract An integral part of the test evaluation and improvement process involves
More informationAbstract Title: Identifying and measuring factors related to student learning: the promise and pitfalls of teacher instructional logs
Abstract Title: Identifying and measuring factors related to student learning: the promise and pitfalls of teacher instructional logs MSP Project Name: Assessing Teacher Learning About Science Teaching
More informationDescriptive Statistics
Descriptive Statistics Primer Descriptive statistics Central tendency Variation Relative position Relationships Calculating descriptive statistics Descriptive Statistics Purpose to describe or summarize
More informationPELLISSIPPI STATE COMMUNITY COLLEGE MASTER SYLLABUS INTRODUCTION TO STATISTICS MATH 2050
PELLISSIPPI STATE COMMUNITY COLLEGE MASTER SYLLABUS INTRODUCTION TO STATISTICS MATH 2050 Class Hours: 2.0 Credit Hours: 3.0 Laboratory Hours: 2.0 Date Revised: Fall 2013 Catalog Course Description: Descriptive
More informationCalculating, Interpreting, and Reporting Estimates of Effect Size (Magnitude of an Effect or the Strength of a Relationship)
1 Calculating, Interpreting, and Reporting Estimates of Effect Size (Magnitude of an Effect or the Strength of a Relationship) I. Authors should report effect sizes in the manuscript and tables when reporting
More informationAnalysis of School Finance Equity and Local Wealth Measures in Maryland
Analysis of School Finance Equity and Local Wealth Measures in Maryland Prepared for The Maryland State Department of Education By William J. Glenn Mike Griffith Lawrence O. Picus Allan Odden Picus Odden
More informationResponsibilities of Users of Standardized Tests (RUST) (3rd Edition) Prepared by the Association for Assessment in Counseling (AAC)
Responsibilities of Users of Standardized Tests (RUST) (3rd Edition) Prepared by the Association for Assessment in Counseling (AAC) Many recent events have influenced the use of tests and assessment in
More information2. The AEC must be identified in AskTED (the Texas School Directory database) as an alternative campus.
2. The AEC must be identified in AskTED (the Texas School Directory database) as an alternative campus. 3. The AEC must be dedicated to serving students at risk of dropping out of school as defined in
More informationValidity, reliability, and concordance of the Duolingo English Test
Validity, reliability, and concordance of the Duolingo English Test May 2014 Feifei Ye, PhD Assistant Professor University of Pittsburgh School of Education feifeiye@pitt.edu Validity, reliability, and
More informationCenter for Advanced Studies in Measurement and Assessment. CASMA Research Report
Center for Advanced Studies in Measurement and Assessment CASMA Research Report Number 13 and Accuracy Under the Compound Multinomial Model Won-Chan Lee November 2005 Revised April 2007 Revised April 2008
More informationAssessing Quantitative Reasoning in GE (Owens, Ladwig, and Mills)
Assessing Quantitative Reasoning in GE (Owens, Ladwig, and Mills) Introduction Many students at CSU, Chico, receive much of their college-level mathematics education from the one MATH course they complete
More informationAnnual Goals for Math & Computer Science
Annual Goals for Math & Computer Science 2010-2011 Gather student learning outcomes assessment data for the computer science major and begin to consider the implications of these data Goal - Gather student
More informationValidity Study of Non-MBA Programs
Validity Study of Non-MBA Programs Eileen Talento-Miller GMAC Research Reports RR-09-12 November 24, 2009 Although the validity of Graduate Management Admission Test (GMAT ) scores for predicting performance
More informationAccountability Brief
Accountability Brief Public Schools of North Carolina State Board of Education North Carolina Department of Public Instruction Michael E. Ward, Superintendent March, 2003 Setting Annual Growth Standards:
More informationCollege Readiness LINKING STUDY
College Readiness LINKING STUDY A Study of the Alignment of the RIT Scales of NWEA s MAP Assessments with the College Readiness Benchmarks of EXPLORE, PLAN, and ACT December 2011 (updated January 17, 2012)
More informationAdministrative Decision Making in Early Childhood Education:
Administrative Decision Making in Early Childhood Education: HOW DO I USE THE REPORTS AVAILABLE IN ONLINE REPORTER? Assessment Technology, Incorporated Written by: Jason K. Feld, Ph.D. Assessment Technology,
More informationChapter 5 English Language Learners (ELLs) and the State of Texas Assessments of Academic Readiness (STAAR) Program
Chapter 5 English Language Learners (ELLs) and the State of Texas Assessments of Academic Readiness (STAAR) Program Demographic projections indicate that the nation s English language learner (ELL) student
More informationCollege Readiness in the US Service Area 2010 Baseline
College Readiness in the United Way Service Area 2010 Baseline Report The Institute for Urban Policy Research At The University of Texas at Dallas College Readiness in the United Way Service Area 2010
More informationAnalyzing and interpreting data Evaluation resources from Wilder Research
Wilder Research Analyzing and interpreting data Evaluation resources from Wilder Research Once data are collected, the next step is to analyze the data. A plan for analyzing your data should be developed
More informationIntroduction to. Hypothesis Testing CHAPTER LEARNING OBJECTIVES. 1 Identify the four steps of hypothesis testing.
Introduction to Hypothesis Testing CHAPTER 8 LEARNING OBJECTIVES After reading this chapter, you should be able to: 1 Identify the four steps of hypothesis testing. 2 Define null hypothesis, alternative
More informationMeasuring the Effectiveness of Rosetta Stone
FINAL REPORT Measuring the Effectiveness of Rosetta Stone Roumen Vesselinov, Ph.D. Visiting Assistant Professor Queens College City University of New York Roumen.Vesselinov@qc.cuny.edu (718) 997-5444 January
More informationThe Effect of Test Preparation on Student Performance
The Effect of Admissions Test Preparation: Evidence from NELS:88 Introduction For students planning to apply to a four year college, scores on standardized admissions tests--the SAT I or ACT--take on a
More informationUtilization of Response Time in Data Forensics of K-12 Computer-Based Assessment XIN LUCY LIU. Data Recognition Corporation (DRC) VINCENT PRIMOLI
Response Time in K 12 Data Forensics 1 Utilization of Response Time in Data Forensics of K-12 Computer-Based Assessment XIN LUCY LIU Data Recognition Corporation (DRC) VINCENT PRIMOLI Data Recognition
More informationGeorgia s New Tests. Language arts assessments will demonstrate: Math assessments will demonstrate: Types of assessments
Parents Guide to New Tests in Georgia In 2010, Georgia adopted the Common Core State Standards (CCSS) in English language arts and mathematics and incorporated them into the existing Georgia Performance
More informationFinal Exam Performance. 50 OLI Accel Trad Control Trad All. Figure 1. Final exam performance of accelerated OLI-Statistics compared to traditional
IN SEARCH OF THE PERFECT BLEND BETWEEN AN INSTRUCTOR AND AN ONLINE COURSE FOR TEACHING INTRODUCTORY STATISTICS Marsha Lovett, Oded Meyer and Candace Thille Carnegie Mellon University, United States of
More informationData Analysis, Statistics, and Probability
Chapter 6 Data Analysis, Statistics, and Probability Content Strand Description Questions in this content strand assessed students skills in collecting, organizing, reading, representing, and interpreting
More information2011-12 Early Mathematics Placement Tool Program Evaluation
2011-12 Early Mathematics Placement Tool Program Evaluation Mark J. Schroeder James A. Wollack Eric Tomlinson Sonya K. Sedivy UW Center for Placement Testing 1 BACKGROUND Beginning with the 2008-2009 academic
More informationA STUDY OF WHETHER HAVING A PROFESSIONAL STAFF WITH ADVANCED DEGREES INCREASES STUDENT ACHIEVEMENT MEGAN M. MOSSER. Submitted to
Advanced Degrees and Student Achievement-1 Running Head: Advanced Degrees and Student Achievement A STUDY OF WHETHER HAVING A PROFESSIONAL STAFF WITH ADVANCED DEGREES INCREASES STUDENT ACHIEVEMENT By MEGAN
More informationCRITICAL THINKING ASSESSMENT
CRITICAL THINKING ASSESSMENT REPORT Prepared by Byron Javier Assistant Dean of Research and Planning 1 P a g e Critical Thinking Assessment at MXC As part of its assessment plan, the Assessment Committee
More informationAPEX program evaluation study
APEX program evaluation study Are online courses less rigorous than in the regular classroom? Chung Pham Senior Research Fellow Aaron Diel Research Analyst Department of Accountability Research and Evaluation,
More informationWoodcock Reading Mastery Tests Revised, Normative Update (WRMT-Rnu) The normative update of the Woodcock Reading Mastery Tests Revised (Woodcock,
Woodcock Reading Mastery Tests Revised, Normative Update (WRMT-Rnu) The normative update of the Woodcock Reading Mastery Tests Revised (Woodcock, 1998) is a battery of six individually administered tests
More informationStatistics. Measurement. Scales of Measurement 7/18/2012
Statistics Measurement Measurement is defined as a set of rules for assigning numbers to represent objects, traits, attributes, or behaviors A variableis something that varies (eye color), a constant does
More informationData Coding and Entry Lessons Learned
Chapter 7 Data Coding and Entry Lessons Learned Pércsich Richárd Introduction In this chapter we give an overview of the process of coding and entry of the 1999 pilot test data for the English examination
More informationSTATISTICS FOR PSYCHOLOGISTS
STATISTICS FOR PSYCHOLOGISTS SECTION: STATISTICAL METHODS CHAPTER: REPORTING STATISTICS Abstract: This chapter describes basic rules for presenting statistical results in APA style. All rules come from
More informationNebraska School Counseling State Evaluation
Nebraska School Counseling State Evaluation John Carey and Karen Harrington Center for School Counseling Outcome Research Spring 2010 RESEARCH S c h o o l o f E d u c a t i o n U n i v e r s i t y o f
More informationCourse Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics
Course Text Business Statistics Lind, Douglas A., Marchal, William A. and Samuel A. Wathen. Basic Statistics for Business and Economics, 7th edition, McGraw-Hill/Irwin, 2010, ISBN: 9780077384470 [This
More informationInformation and Employee Evaluation: Evidence from a Randomized Intervention in Public Schools. Jonah E. Rockoff 1 Columbia Business School
Preliminary Draft, Please do not cite or circulate without authors permission Information and Employee Evaluation: Evidence from a Randomized Intervention in Public Schools Jonah E. Rockoff 1 Columbia
More informationAutomated Scoring for the Assessment of Common Core Standards
Automated Scoring for the Assessment of Common Core Standards David M. Williamson, Senior Research Director, Applied Research and Development, ETS Randy E. Bennett, Frederiksen Chair, Assessment Innovation,
More informationOnline Assessment and the Comparability of Score Meaning
Research Memorandum Online Assessment and the Comparability of Score Meaning Randy Elliott Bennett Research & Development November 2003 RM-03-05 Online Assessment and the Comparability of Score Meaning
More informationRecommendations for Implementation of 14-point Test Security Plan
Texas Education Agency Recommendations for Implementation of 14-point Test Security Plan Recommendation #1 TEA will analyze scrambled blocks of test questions to detect answer copying. Given the blueprint
More informationEvaluating the Comparability of Scores from Achievement Test Variations
Evaluating the Comparability of Scores from Achievement Test Variations Phoebe C. Winter, Editor Copyright 2010 by the Council of Chief State School Officers, Washington, DC All rights reserved. THE COUNCIL
More informationTEA UPDATE ON STAAR MATHEMATICS. Texas Education Agency Student Assessment Division Julie Guthrie July 2014
TEA UPDATE ON STAAR MATHEMATICS Texas Education Agency Student Assessment Division Julie Guthrie July 2014 New STAAR Mathematics Information Implementation of New STAAR Mathematics General STAAR Updates
More informationDevice Comparability of Tablets and Computers for Assessment Purposes
Device Comparability of Tablets and Computers for Assessment Purposes National Council on Measurement in Education Chicago, IL Laurie Laughlin Davis, Ph.D. Xiaojing Kong, Ph.D. Yuanyuan McBride, Ph.D.
More informationA Comparison Between Online and Face-to-Face Instruction of an Applied Curriculum in Behavioral Intervention in Autism (BIA)
A Comparison Between Online and Face-to-Face Instruction of an Applied Curriculum in Behavioral Intervention in Autism (BIA) Michelle D. Weissman ROCKMAN ET AL San Francisco, CA michellew@rockman.com Beth
More informationResults: Statewide Stakeholder Consultation on Draft Early Childhood Standards and Indicators
Results: Statewide Stakeholder Consultation on Draft Early Childhood Standards and Indicators Prepared for Minnesota Department of Education and the Minnesota Department of Human Services February 2011
More informationOBJECTIVE ASSESSMENT OF FORECASTING ASSIGNMENTS USING SOME FUNCTION OF PREDICTION ERRORS
OBJECTIVE ASSESSMENT OF FORECASTING ASSIGNMENTS USING SOME FUNCTION OF PREDICTION ERRORS CLARKE, Stephen R. Swinburne University of Technology Australia One way of examining forecasting methods via assignments
More informationUsing Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data
Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data Introduction In several upcoming labs, a primary goal will be to determine the mathematical relationship between two variable
More informationAssessments in Arizona
Parents Guide to new Assessments in Arizona In June 2010, Arizona adopted the Common Core State Standards (CCSS) which were customized to meet the needs of our state and released as the Arizona Common
More informationWhy Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts
More informationCOMMON CORE STATE STANDARDS FOR
COMMON CORE STATE STANDARDS FOR Mathematics (CCSSM) High School Statistics and Probability Mathematics High School Statistics and Probability Decisions or predictions are often based on data numbers in
More informationTest Scoring And Course Evaluation Service
Test Scoring And Course Evaluation Service TABLE OF CONTENTS Introduction... 3 Section 1: Preparing a Test or Questionnaire... 4 Obtaining the Answer Forms... 4 Planning the Test or Course evaluation...
More informationNational assessment of foreign languages in Sweden
National assessment of foreign languages in Sweden Gudrun Erickson University of Gothenburg, Sweden Gudrun.Erickson@ped.gu.se This text was originally published in 2004. Revisions and additions were made
More informationRARITAN VALLEY COMMUNITY COLLEGE ACADEMIC COURSE OUTLINE MATH 111H STATISTICS II HONORS
RARITAN VALLEY COMMUNITY COLLEGE ACADEMIC COURSE OUTLINE MATH 111H STATISTICS II HONORS I. Basic Course Information A. Course Number and Title: MATH 111H Statistics II Honors B. New or Modified Course:
More informationUnderstanding District-Determined Measures
Understanding District-Determined Measures 2013-2014 2 Table of Contents Introduction and Purpose... 5 Implementation Timeline:... 7 Identifying and Selecting District-Determined Measures... 9 Key Criteria...
More information1/27/2013. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2
PSY 512: Advanced Statistics for Psychological and Behavioral Research 2 Introduce moderated multiple regression Continuous predictor continuous predictor Continuous predictor categorical predictor Understand
More informationArizona and New York Schools Push the Envelope
Arizona and New York Schools Push the Envelope Sean Brady (Prism Decision Systems, LLC) and Andrew Tait (Idea Sciences) In the United States, accountability measures from the No Child Left Behind (NCLB)
More informationSPECIFIC LEARNING DISABILITY
SPECIFIC LEARNING DISABILITY 24:05:24.01:18. Specific learning disability defined. Specific learning disability is a disorder in one or more of the basic psychological processes involved in understanding
More information