Comparing Apples with Oranges? Linking National and International Large-scale Assessments

Comparing Apples with Oranges? Linking National and International Large-scale Assessments Olaf Köller Leibniz Institute für Science and Mathematics Education (Kiel) and Centre for International Student Assessment (Munich) "Standard setting: International state of research and practices in the nordic countries Oslo, September 23, 2015 Prof. Dr. Olaf Köller, Leibniz Institute for Science and Mathematics Education

Comparing Apples with Oranges? Co-operation Project Annika Nissen Timo Ehmke Olaf Köller Christoph Duchardt Core Reference: Nissen, A., Ehmke, T., Köller, O. & Duchardt, C. (2015). Comparing apples with oranges? An approach to link TIMSS and the National Educational Panel Study in Germany via equipercentile and IRT methods. Studies in Educational Evaluation, 47, 58-67. DOI 10.1016/j.stueduc.2015.07.003 0191-491X. Prof. Dr. Olaf Köller, Leibniz Institute for Science and Mathematics Education 2

Starting point: Large-scale Assessments in Germany 2009 2010 2011 2012 2013 2014 2015 2016 PIRLS u u TIMSS u u PISA u u u NA-PS u u NA-SS u u u NEPS u u u PIRLS: Progress in Reading Literacy Study TIMSS: Trends in Mathematics and Science Study NA- PS: National Assessment in Primary School (Grade 4; German, Mathematics) NA- SS: National Assessment in Secondary School (Languages vs. Math & Science) NEPS: National Educational Panel Study; Data Collections in Grades 5 and 9 Prof. Dr. Olaf Köller, Leibniz Institute for Science and Mathematics Education 3

Samples PIRLS: Nationally representative sample of approx. n = 4.500 students at the end of grade 4 TIMSS: Nationally representative sample of approx. n = 4.500 students at the end of grade 4 PISA: Nationally representative sample of approx. n = 5.000 15- year old students and of approx. n = 9.000 9 th graders NA-PS: Representative samples of all 16 federal states (approx. n = 2.000 per state) at the end of grade 4 NA-SekS: Representative samples of all 16 federal states (approx. n = 3.000 per state) at the end of grade 9 (approx. n = 50.000) NEPS: Nationally representative samples of students at the beginning of grade 5 (n = 7.500) and at the beginning of grade 9 (n = 15.000) 23.09.15 Leibniz Institute for Science and Mathematics Education, Kiel, Germany

Test Designs PIRLS: Multi-matrix design; 80 minutes testing (3PL) TIMSS: Multi-matrix design; 80 minutes testing math and science (3PL) PISA: Multi-matrix design; 120 minutes testing; major domain and minor domains (1PL) NA-PS: Multi-matrix design; 80 minutes testing, parts of the sample only take mother tongue, others math, others mother tongue plus math (1PL) NA-SS: Multi-matrix design; 120 minutes testing, 60 minutes for each domain (1PL) NEPS: All students work on same items, 30 minutes math, 30 minutes science, 30 minutes reading (1PL) 23.09.15 Leibniz Institute for Science and Mathematics Education, Kiel, Germany

Reasearch Questions Strong political as well as scientific pressure to link studies (see e.g., similar activities in USA, where NAEP 8 and TIMSS have been linked in 2011) Different tests, different constructs? Different tests, different proficiency level models? Can we use national tests to assess our students on international scales and vice versa? 23.09.15 Leibniz Institute for Science and Mathematics Education, Kiel, Germany

Research Question I: Graphical Illustration 23.09.15 Leibniz Institute for Science and Mathematics Education, Kiel, Germany

Reasearch Questions II and III: Graphical Illustration National Proficiency Levels International Proficiency Levels Level 5 Level 4 National Student and Item Sample Level 5 Level 4 Level 3 Level 3 Level 2 Level 1 International Student and Item Sample Level 2 Level 1 23.09.15 Leibniz Institute for Science and Mathematics Education, Kiel, Germany

Linking TIMSS, PIRLS, National Assessment, and National Educational National Assessment German/Math 4 th graders (1 class per school) 1.300 schools TIMSS/PIRLS 4 th graders (1 class per school) 201 schools TIMSS/NEPS/ National Assessment 4 th graders (1 class per school) 80 schools

Linking TIMSS, PIRLS, National Assessment, and National Educational TIMSS/PIRLS 4 th graders (1 class per school) 201 schools TIMSS/NEPS/ National Assessment 4 th graders (1 class per school) 80 schools

Both studies measure mathematics competencies Aim: Link these studies to use international benchmarks Often two linking-methods are distinguished: Classical Test Theory Equating Which method fits? Item-Response-Theory Equating the data better Leibniz Institute for Science and Mathematics Education, Kiel, Germany

Which linking method (CTT or IRT) should be preferred regarding the: (1) descriptive measurements? (2) the classification accuracy to the TIMSS International Benchmarks? (3) analysis of different subgroups? Leibniz Institute for Science and Mathematics Education, Kiel, Germany

Different conceptual approaches, but both with the aim to measure mathematics competencies at the end of primary school (TIMSS) respectively at the beginning of grade 5 (NEPS) Leibniz Institute for Science and Mathematics Education, Kiel, Germany

% Linking TIMSS and National Educational

78 Primary Schools in Germany 80 classes N = 733 fourth graders (52 % male, 48 % female) Leibniz Institute for Science and Mathematics Education, Kiel, Germany

Leibniz Institute for Science and Mathematics Education, Kiel, Germany

TIMSS 2011, Mathematics, Grade 4: 3-parameter IRT model Fixed item parameters from international database Transformation of Students PVs into international TIMSS achievement scale metric NEPS 2010, Mathematics, Grade 5: 1-parameter Rasch Model Fixed item parameters taken from NEPS 2010 Students WLEs transformed into only positive integer Leibniz Institute for Science and Mathematics Education, Kiel, Germany

Model 1 (2 Dim.) Model 2 (1 Dim.) N Parameter Deviance AIC BIC CAIC NEPS- LV 2- dim 752 238 51338 51814 52023 52261 NEPS- LV 1- dim 752 236 51375 51847 52054 52290 1PL model;; Findings from ConQuest 3.0 Leibniz Institute for Science and Mathematics Education, Kiel, Germany

I. Classical Test TheoryEquating e.g. Equipercentile Equating: (Cartwright, 2012;; Hambleton et al., 2009) 1) Determine percentile ranks for the score distributions 2) Declare the scores with the same percentile as equivalent à Linking basis: Score distributions II. IRT Linking (Pietsch et al., 2009;; NCES, 2013) 1) Estimating item parameters 2) Scaling estimated parameters to a base IRT scale (linear transformation) 3) Transform true scores of new test form to true score scale on old form. à Linking basis: modeling student s responses to items Leibniz Institute for Science and Mathematics Education, Kiel, Germany

I. Equipercentile Equating (1) Finding percentile rank for each score value of the TIMSS-test and NEPStest (2) Matching the scores by the corresponding percentile values using the Software LEGS 2.01 (Brennan, 2004) Leibniz Institute for Science and Mathematics Education, Kiel, Germany

Results Equipercentile Equating NEPS score distribution TIMSS score distribution x* freq cum freq f(x) F(x) P(x) 0 3 3 0.00 0.00 0.21 29 2 5 0.00 0.01 0.55 53 7 12 0.01 0.02 1.16 74 8 20 0.01 0.03 2.18 93 7 27 0.01 0.04 3.21 111 6 33 0.01 0.05 4.09 127 10 43 0.01 0.06 5.18 142 15 58 0.02 0.08 6.89.................. 478 5 678 0.01 0.92 92.16 493 18 696 0.02 0.95 93.72 510 6 702 0.01 0.96 95.36 528 12 714 0.02 0.97 96.59 570 13 727 0.02 0.99 98.30 597 1 728 0.00 0.99 99.25 631 2 730 0.00 1.00 99.45 754 3 733 0.00 1.00 99.80 y** freq cum freq g(y) G(y) Q(y) 355 1 1 0.00 0.00 0.07 375 2 3 0.00 0.00 0.27 380 1 4 0.00 0.01 0.48 385 1 5 0.00 0.01 0.62 395 3 8 0.00 0.01 0.89 405 2 10 0.00 0.01 1.23 410 2 12 0.00 0.02 1.50 415 3 15 0.00 0.02 1.84.................. 690 3 720 0.00 0.98 98.16 695 2 722 0.00 0.99 98.50 700 2 724 0.00 0.99 98.77 705 1 725 0.00 0.99 98.98 710 2 727 0.00 0.99 99.18 715 2 729 0.00 1.00 99.45 720 2 731 0.00 1.00 99.73 735 1 732 0.00 1.00 99.93

Results Equipercentile Equating 750 700 650 600 TIMSS scores 550 500 450 400 350 0 100 200 300 400 500 600 700 800 NEPS scores

IRT Linking (1) Scaling TIMSS & NEPS data simultaneously in a single IRT model by fixing item parameters of TIMSS international scale (2) Calibrating NEPS data with item parameters of the common scaling (3) Converting student s score into score equivalents in the TIMSS international scale Leibniz Institute for Science and Mathematics Education, Kiel, Germany

Results Equating Descriptive statistics for NEPS and TIMSS MEAN SD SKEW KURT NEPS 307 118 0.27 3.35 TIMSS 545 64 0.07 3.06 Equipercentile Equating 545 63 0.06 2.99 IRT- Equating 545 72 0.07 2.79 Classification of students to TIMSS International Benchmarks in Mathematics TIMSS 2011 International Benchmarks Cohen's < low low intermediate high advanced Sum Kappa TIMSS 1.2% 13.1% 38.6% 37.7% 9.4% 100.0% Equipercentile Equating 0.7% 14.3% 39.6% 37.2% 8.2% 100.0% 0.384 IRT Equating 1,6% 13.4% 39.6% 33.7% 11.7% 100.0% 0.371 Leibniz Institute for Science and Mathematics Education, Kiel, Germany

Results Equating IRT Linking Equipercentile Equating 1 2 3 4 5 Sum 1 0.4 0.7 0.0 0.0 0.0 1.1 2 0.0 6.1 3.7 0.2 0.0 9.9 3 0.0 0.0 25.1 8.4 0.1 33.6 4 0.0 0.0 0.0 37.8 2.6 40.4 5 0.0 0.0 0.0 1.9 13.1 15.0 Sum 0.4 6.8 28.8 48.3 15.8 100.0 Leibniz Institute for Science and Mathematics Education, Kiel, Germany

Discussion Application of linking NEPS and TIMSS: Criterium-based interpretation of the NEPS mathematics scores Basis for longitudinal studies on students that fail the lowest (or reach the highest) educational standards In the Validation Study a) both methods lead to à Same estimates of population means à Classification accuracy of proficiency levels is satisfying à Similar Skwenes and Kurtosis b) Equipercentile methods should be prefered regarding à estimation of standard deviations Leibniz Institute for Science and Mathematics Education, Kiel, Germany

Thank you very much for your attention! Contact: koeller@ipn.uni-kiel.de Prof. Dr. Olaf Köller, Leibniz Institute for Science and Mathematics Education 27