The effects of value-added modeling decisions on estimates of teacher effectiveness

Transcription

1 University of Iowa Iowa Research Online Theses and Dissertations 214 The effects of value-added modeling decisions on estimates of teacher effectiveness Paula Lynn Cunningham University of Iowa Copyright 214 Paula Lynn Cunningham This dissertation is available at Iowa Research Online: Recommended Citation Cunningham, Paula Lynn. "The effects of value-added modeling decisions on estimates of teacher effectiveness." PhD (Doctor of Philosophy) thesis, University of Iowa, Follow this and additional works at: Part of the Educational Psychology Commons

2 THE EFFECTS OF VALUE-ADDED MODELING DECISIONS ON ESTIMATES OF TEACHER EFFECTIVENESS by Paula Lynn Cunningham A thesis submitted in partial fulfillment of the requirements for the Doctor of Philosophy degree in Psychological and Quantitative Foundations (Educational Measurement and Statistics) in the Graduate College of The University of Iowa December 214 Thesis Supervisor: Professor Catherine J. Welch

4 Graduate College The University of Iowa Iowa City, Iowa CERTIFICATE OF APPROVAL PH.D. THESIS This is to certify that the Ph.D. thesis of Paula Lynn Cunningham has been approved by the Examining Committee for the thesis requirement for the Doctor of Philosophy degree in Psychological and Quantitative Foundations (Educational Measurement and Statistics) at the December 214 graduation. Thesis Committee: Catherine J. Welch, Thesis Supervisor Robert D. Ankenmann Timothy N. Ansley Stephen B. Dunbar Marcus J. Haack David B. Bills

5 To the memory of Bernice Rita Shanklin ii

6 Never despair, but if you do, work on in despair. Fortune Cookie Chuong Garden Restaurant Grinnell, Iowa iii

7 ACKNOWLEDGMENTS I wish to express my sincere gratitude to my advisor, Cathy Welch, for her guidance, support, and understanding, both during the dissertation research and through all my years as a graduate student. A knowledgeable and patient mentor, she has helped me to become an independent researcher. Most of all, I am thankful for her kindness and reassurance when I needed to pause my graduate study and her encouragement when I was able to resume it. I also wish to acknowledge the generous input of Steve Dunbar to the success of this research project from its beginning. In addition, I thank all the members of my dissertation committee for their insightful feedback in the form of suggestions that aim to strengthen this document, helping to make it an accomplishment of which I can truly be proud. I also wish to thank all the talented people of the Iowa Testing Programs, and Matt Whittaker in particular for his efforts in creating the matched longitudinal data sets and generating the state means results used in this study. It was as a graduate research assistant for Iowa Testing Programs that I learned how test development and psychometric research are accomplished. Through assignments that challenged me and the realization that my contributions mattered, I grew more confident as I gained understanding. I feel grateful for having had the privilege of working among these dedicated professionals. Not least of all I acknowledge the source of my strength to continue in this enterprise, the solid foundation keeping me upright: my family. I cannot praise too highly my husband Charles and son Evan for their encouragement at the outset of this journey and their support through its completion. From the shaky first semester to comprehensive examinations and the dissertation phase they have been there for me, sharing all the trials, successes, despair and joy in short, life that happened along the way. iv

8 ABSTRACT This study was undertaken to evaluate the impact of modeling decisions made by those charged with implementing teacher evaluation systems that incorporate student achievement data; such choices include how growth is to be modeled, whether student characteristics are to be controlled for, how many years of data are to be used, and which test subject is to be selected. Using a three-cohort longitudinal data set from a school district in which reading and mathematics test scores from a vertically-scaled assessment allowed determination of growth in grades three, four, and five, estimated teacher effects were derived from five value-added models, and the resulting rank orderings of the teachers were examined. The models compared were a covariate adjustment model that conditioned on prior achievement only, a covariate adjustment model that conditioned on certain student characteristics as well as prior achievement, a gain score model, the growth model underlying the vertically-scaled assessment, and student growth percentiles. Teacher rank orderings derived under the five models were highly consistent with one another using either one or three classroom years of test scores. Only when the movement of teachers between quartiles was examined did a difference in performance between some models emerge. The high degree of consistency between the two covariate adjustment models suggested that control for student-level characteristics was unnecessary. Using three years of test scores rather than one led to a small decrease in between-model correlations and a small increase in teacher movement between quartiles. Comparison of teacher value-added based on reading scores versus mathematics scores gave mixed results, with between-model correlations in mathematics being slightly higher than those for reading but with reading showing greater consistency in quartile movement between cohorts. v

9 The year-to-year change in teacher rank orderings was very striking, as low, and even negative, correlations emerged between years. Movement of teachers between quartiles from one year to the next was far greater than that observed when comparing the modeling conditions. Using a teacher rating scheme in which groups of teachers were distinguished from average effectiveness if they appeared in the extremes of the rankings, nearly half of teachers changed ratings from one year to the next. Such low intertemporal stability of teacher value-added is a significant result that should be considered by all stakeholders in teacher evaluation. vi

10 PUBLIC ABSTRACT This study examined the impact of modeling decisions made in implementing value-added teacher evaluation; such choices include the growth model itself, whether to control for student characteristics, how many years of scores to use, and the subject tested. Estimates of teacher effectiveness were derived from five models, which were a covariate adjustment model that conditioned on prior achievement only, a covariate adjustment model that conditioned on certain student characteristics as well as prior achievement, a gain score model, the growth model underlying the assessment, and student growth percentiles. The resulting rank orderings of the teachers were examined and found to be highly consistent with one another using scores for either one or three classroom years. When the movement of teachers between quartiles of the rank orderings was examined, a difference in performance between some models did emerge. The covariate adjustment models were highly consistent, suggesting that control for student-level characteristics was unnecessary. Using three years of data rather than one did not significantly change model performance, and comparison of rank orderings based on reading scores versus mathematics scores gave mixed results. The year-to-year inconsistency in rank orderings was striking. Movement of teachers between quartiles from one year to the next was far greater than that observed when comparing modeling conditions. Under a rating scheme in which teachers were distinguished from average effectiveness if they appeared in the extremes of the rankings, nearly half of teachers changed ratings from one year to the next. vii

11 TABLE OF CONTENTS LIST OF TABLES...x LIST OF FIGURES... xii CHAPTER I INTRODUCTION...1 An Approach to Teacher Evaluation...1 Implementing VAM-based Teacher Evaluation...3 Purpose of the Study and Research Questions...7 CHAPTER II LITERATURE REVIEW...8 Status versus Growth...8 Growth Models...9 Growth Models versus Value-Added Models...11 Four Widely Used Models...12 Gain Score Model...12 Residual Gain/Covariate Adjustment Model...13 Student Growth Percentile Model...1 Educational Value-Added Assessment System...16 Research on Comparison of Models...17 Ongoing Concerns about Value-Added Models...19 Bias...19 Precision...2 Stability...22 Practical Considerations...23 CHAPTER III METHODS...28 Data...28 Value-Added Models...31 Covariate Adjustment Model 1 (CA1)...31 Covariate Adjustment Model 2 (CA2)...32 Gain Score Model (GAIN)...3 Iowa Growth Model (IOWA)...3 Student Growth Percentile Model (SGP)...36 The Study and Research Questions...37 Section 1: Question 1a...37 Section 2: Question 1b...39 Section 3: Question Section 4: Question viii

12 CHAPTER IV RESULTS...4 Section 1: Effect of Model Choice with Single Cohorts...4 Spearman Rank Order Correlations... Quartile Analysis... Section 2: Effect of Model Choice with Multiple Cohorts...6 Spearman Rank Order Correlations...8 Quartile Analysis...8 Section 3: Stability between Cohorts...9 Teacher Retention between Cohorts...9 Between-cohort Spearman Rank Order Correlations...6 Quartile Analysis...61 Rating Consistency...63 Section 4: Generalizability across Tests...64 Effect of Model Choice with Single Cohorts...64 Effect of Model Choice with Multiple Cohorts...6 Stability between Cohorts...67 Between-subject Spearman Rank Order Correlations...69 Summary of Results...69 CHAPTER V DISCUSSION...19 Summary of Findings...19 Research Question Research Question Research Question Implications for Practice Limitations and Continuing Research...12 Conclusion APPENDIX CATERPILLAR PLOTS OF TEACHER VALUE-ADDED...12 REFERENCES...14 ix

13 LIST OF TABLES Table 3.1 Table 3.2 Table 3.3 Table 3.4 Group Means with Standard Deviations on the Reading Subtest for All Cohorts and Grades...49 Group Means with Standard Deviations on the Mathematics Subtest for All Cohorts and Grades... Percentages of Students with Positive Status on FRL, IEP, ELL, and Combinations Thereof...1 Correlations between Reading Subtest Score, Mathematics Subtest Score, FRL, IEP, and ELL Variables...2 Table 3. R 2 Values for Best Predictive Models...3 Table 4.1 Table 4.2 Table 4.3 Table 4.4 Table 4. Table 4.6 Pooled Spearman Rank Order Correlations between Models for Single-year Analysis...73 Transition Matrices Showing Quartile Consistency between Models for Single-year Analysis...74 Percent of Teachers who Changed Quartile by Model for Single-year Analysis...76 Pooled Spearman Rank Order Correlations between Models for Multiple-year Analysis...77 Transition Matrices Showing Quartile Consistency between Models for Multiple-year Analysis...78 Percent of Teachers who Changed Quartile by Model for Multipleyear Analysis...8 Table 4.7 Percent Teacher Retention between Cohorts...81 Table 4.8 Pooled Spearman Rank Order Correlations between Cohorts...82 Table 4.9 Median Spearman Rank Order Correlations between Cohorts...83 Table 4.1 Transition Matrices Showing Year-to-year Consistency of Quartiles...84 Table 4.11 Percent of Teachers who Changed Quartile Year-to-year...86 Table 4.12 Percent of Teachers who Changed Rating Year-to-year...87 x

14 Table 4.13 Table 4.14 Table 4.1 Table 4.16 Table 4.17 Table 4.18 Table 4.19 Table 4.2 Table 4.21 Table 4.22 Table 4.23 Table 4.24 Spearman Correlations between Models Pooled by Subject for Single-year Analysis...88 Transition Matrices Showing Quartile Consistency between Models for Single-year Analysis for the Reading Subtest...89 Transition Matrices Showing Quartile Consistency between Models for Single-year Analysis for the Mathematics Subtest...91 Percent of Teachers who Changed Quartile Due to Model by Subtest for Single-year Analysis...93 Spearman Correlations between Models Pooled by Subject for Multiple-year Analysis...94 Transition Matrices Showing Quartile Consistency between Models for Multiple-year Analysis for the Reading Subtest...9 Transition Matrices Showing Quartile Consistency between Models for Multiple-year Analysis for the Mathematics Subtest...97 Percent of Teachers who Changed Quartile Due to Model by Subtest for Multiple-year Analysis...99 Pooled Spearman Rank Order Correlations between Cohorts by Subtest...1 Median Spearman Rank Order Correlations between Cohorts by Subtest...11 Transition Matrices Showing Year-to-year Consistency in Quartiles for Reading Subtest...12 Transition Matrices Showing Year-to-year Consistency in Quartiles for Mathematics Subtest...14 Table 4.2 Percent of Teachers who Changed Quartile Year-to-year by Subject...16 Table 4.26 Percent of Teachers who Changed Rating Year-to-year by Subject...17 Table 4.27 Between-subject Spearman Correlations Pooled over Methods...18 Table.1 Additional Test Items Answered Correctly by the Class of the Highest-ranked Teacher Compared to the Class of the Lowest-ranked Teacher xi

15 LIST OF FIGURES Figure 2.1 Illustration of the Gain Score Model...2 Figure 2.2 Figure 2.3 Illustration of the Residual Gain Model...26 Illustration of a Linear Regression Line and a Median Quantile Regression Line...27 Figure 3.1 Structure of the Longitudinal Data Sets...44 Figure 3.2 Attribution of Growth Using Fall-to-fall Testing Schedule...4 Figure 3.3 Figure 3.4 Figure 3. Figure 4.1 Figure 4.2 The Iowa Growth Model: Plots Demonstrating the Relationship between Standard Score and Percentile Rank for Levels of the Reading Subtest of the Iowa Assessments...46 The Eighteen Rank Orderings Generated under Each VAM Condition with Single-year Data...47 The Six Rank Orderings Generated under Each VAM Condition with Multiple-year Data...48 Rank Ordering Change from Cohort 1 to Cohort 2 for Fourth Grade Mathematics Using the Gain Score Model...71 Rank Ordering Change from Cohort 1 to Cohort 2 for Fourth Grade Reading Using the Gain Score Model...72 Figure A1 Caterpillar Plots for Cohort 1 under the CA1 Model...12 Figure A2 Caterpillar Plots for Cohort 1 under the CA2 Model Figure A3 Caterpillar Plots for Cohort 1 under the GAIN Model Figure A4 Caterpillar Plots for Cohort 1 under the IOWA Model Figure A Caterpillar Plots for Cohort 1 under the SGP Model Figure A6 Caterpillar Plots for Cohort 2 under the CA1 Model...13 Figure A7 Caterpillar Plots for Cohort 2 under the CA2 Model Figure A8 Caterpillar Plots for Cohort 2 under the GAIN Model xii

16 Figure A9 Caterpillar Plots for Cohort 2 under the IOWA Model Figure A1 Caterpillar Plots for Cohort 2 under the SGP Model Figure A11 Caterpillar Plots for Cohort 3 under the CA1 Model...13 Figure A12 Caterpillar Plots for Cohort 3 under the CA2 Model Figure A13 Caterpillar Plots for Cohort 3 under the GAIN Model Figure A14 Caterpillar Plots for Cohort 3 under the IOWA Model Figure A1 Caterpillar Plots for Cohort 3 under the SGP Model Figure A16 Caterpillar Plots for Combined Cohorts under the CA1 Model...14 Figure A17 Caterpillar Plots for Combined Cohorts under the CA2 Model Figure A18 Caterpillar Plots for Combined Cohorts under the GAIN Model Figure A19 Caterpillar Plots for Combined Cohorts under the IOWA Model Figure A2 Caterpillar Plots for Combined Cohorts under the SGP Model xiii

17 1 CHAPTER I INTRODUCTION Accountability in K-12 education is an ongoing concern. The most recent reauthorization of the Elementary and Secondary Education Act (ESEA), the No Child Left Behind Act of 21 (NCLB), mandated testing of students to hold schools and districts accountable for making Adequate Yearly Progress (AYP) toward 1 percent proficiency in reading and mathematics by 214 to avoid facing sanctions. A few years later, the Secretary of Education announced the Growth Model Pilot Program (GMPP; Spellings, 2); there was subsequent movement by many states away from using the status measure of proficiency toward another measure, growth to a standard, in the belief that using this measure could allow some schools to make AYP that would fail to do so under the status measure. Over time, growth models have become the preferred method of analyzing student achievement test data for the purpose of accountability (Betebenner & Linn, 21). In 29, as part of the American Recovery and Reinvestment Act, the Race to the Top (RTTP) initiative placed emphasis on teacher evaluation using student test scores (United States Department of Education, 29). Value-added modeling, in which student achievement is attributed to various causes, such as teachers, schools, and sometimes background characteristics, is the most recent tool being brought to bear on the question of accountability. With many states choosing to emphasize teacher evaluation and with their students longitudinal data having been recorded over years of standardized testing, value-added modeling is now receiving a lot of attention. An Approach to Teacher Evaluation Numerous states are implementing evaluation systems that incorporate students standardized tests scores to some degree in consequential decisions about teacher salaries, promotions, tenure and even dismissal (Braun, 2). Value-added models (VAMs) are

18 2 used to quantify deviations from expected student performance on a test after a year of instruction, based on characteristics such as the student s achievement on the previous year s test. Teachers in elementary grades whose students take standardized tests in subjects such as reading and mathematics can be held accountable for getting them to achieve their expected scores. The movement toward linking student performance on tests to teacher evaluations gained considerable momentum through the awarding of points in the Race to the Top initiative to states that did link them (Braun, 212). Many proponents take the view that VAMs hold the promise of adding objectivity to teacher evaluation systems that have heretofore relied on seniority, attainment of credentials, and principal observations of classroom performance (Braun, 212). They might suggest that the first two measures do not really reflect teacher effectiveness in the classroom and that principal observations occur too infrequently and result in satisfactory ratings for virtually all teachers, making them less useful as a measure to distinguish between teachers (Papay, 212). In addition, some VAMs purport to control for student background characteristics; this fact has been interpreted as meaning that VAMs level the playing field, so that teachers are evaluated more fairly. Yet VAM-derived teacher effects are themselves known to contain considerable error, in particular when they result from fewer than three years of accumulated test data. They are also subject to unpredictable bias introduced either because they do or do not attempt to account for student background characteristics (McCaffrey, Lockwood, Koretz, & Hamilton, 23). When such statistical controls are introduced, there is a further concern that they result in different achievement expectations for different groups of students (Ballou, Sanders, & Wright, 24). Another consideration is striking a balance between complexity and transparency: VAMs applied in educational settings can be very complex and involve numerous factors, so that explaining to teachers how they work and how their rankings are generated is not simple (National Research Council & National Academy of Education, 21).

19 3 Implementing VAM-based Teacher Evaluation Despite the enthusiasm with which some state legislatures are mandating new teacher evaluation systems that incorporate student test scores, there does not exist a clear set of best practices available to guide those charged with implementing them. There are numerous requirements and consequential decisions facing state departments of education and individual school districts during the process of implementing teacher evaluation systems that rely on the use of VAMs. Adopting such models for teacher evaluation places many requirements on the states and school districts for their proper use. The most obvious requirement is the existence of matched longitudinal test score data for students, and depending on the model chosen, even more student-level demographic data may be required. Within this data, accurate links to classroom teachers must exist, or else the student data will be unable to be included in the analysis and will effectively be considered missing. The problem of missing test scores must be handled either by deletion of cases or imputation of values, with consequences arising from either choice (Cunningham, Welch, & Dunbar, 214). Experts are required both to conduct the analysis using VAMs and to produce reports and lead training sessions that provide support for administrators and educators to make appropriate inferences from the analysis. Furthermore, an evaluation of the system must be established in order to monitor the effects of its implementation on students and teachers alike, being sensitive to unintended consequences that may occur in response. Among the decisions state departments of education and school districts may have some input into are the uses to which these analyses may be put and whether the stakes for educators are high or low. While it is generally agreed by researchers that the use of student achievement test data to evaluate teachers for low stakes purposes, such as for use in establishing which teachers may benefit most from improvement strategies through professional development, is a warranted use of VAMs, there is far less agreement about the extent to which they should be relied upon for mandated evaluation for high stakes

20 4 purposes, such as merit pay or tenure (National Research Council & National Academy of Education, 21). The Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 1999) make clear that there should be evidence of validity and reliability for every test use and that the greater the consequences of the test use are, the stronger the evidence in support of that use should be. States and school districts need to consider that the researchers who understand and use VAMs the most are not in agreement that high-stakes teacher evaluation is an appropriate use of this technique. When the use of student achievement data for teacher evaluation has been mandated, a decision must be made about whether the VAM-derived teacher effects will replace or complement other measures of teacher effectiveness that are being used. It should be considered whether the use of VAMs results in more useful, accurate, and fair outcomes than other measures. All such measures are imperfect, but as part of a teacher evaluation system using multiple measures, such as standardized principal evaluations that include classroom visits and video recordings, tests of teachers content knowledge, surveys of students and parents, and teacher peer evaluations, some concerns expressed by researchers may be allayed (Kane & Staiger, 212). States and school districts must still make a determination about how to weigh the VAM-derived teacher effects with those other measures. Finally, numerous choices need to be made concerning the value-added modeling itself. There are many different types of VAMs discussed in the literature, yet at this time there is no method that has emerged as dominant (National Research Council & National Academy of Education, 21). Some factors that are considered by VAM researchers include whether to specify teacher effects as fixed or random, whether to take a univariate or multivariate approach to modeling, how to disentangle school effects from teacher effects, and how to handle incomplete student records (McCaffrey et al., 23).

21 States and school districts as users of VAM results would likely have to depend upon their experts for advice about the impact of these decisions upon the analyses they conduct. However, these users can and arguably should have input on certain aspects of the modeling so that they have ownership of the process and remain accountable to their stakeholders. State departments of education and individual school districts could be involved in decision making about how to characterize student growth in achievement, how many years of data to use for evaluating teachers, and which, if any, student characteristics to control for in the analysis (Raudenbush, 24). There are many metrics available to characterize the student growth modeled by VAMs, and the preference of growth metric will depend on factors such as the type of assessments available and the ease with which student growth can be understood by policymakers and practitioners. One metric that has seen much use in VAMs is residual gain, which is a measure of how much a student s score deviates from the regression of current scores on past scores; a VAM that uses this method to characterize growth is called a covariate adjustment model. Another metric used in VAMs is the gain score, which is literally the difference between one year s achievement and the prior year s achievement on the score scale. While there is no single preferred model for value-added analysis, these are among the more commonly used choices (McCaffrey et al., 23). There are, however, additional growth metrics that could find application in VAMs. One consideration is that expected annual growth on an assessment, conditional on prior achievement, can be predicted by projecting forward a year through its vertical scale, which is established on a growth model (Furgol, Fina, & Welch, 211). Another growth metric that could be utilized to calculate estimates of teacher value-added is the student growth percentile (SGP; Betebenner, 29). The SGP metric relies on the use of quantile regression, conditioning on prior achievement to describe the current achievement of students. Because the first of these metrics depends on using a verticallyscaled assessment whereas the second does not, the types of assessments available to the

22 6 state or school district may dictate which of these is preferred. Furthermore, a particular growth metric may come to be seen as more acceptable by practitioners, particularly if its details can be communicated thoroughly enough to be accurate yet transparently enough to be understandable. State departments of education and individual school districts adopting VAMbased evaluation systems need to decide over how many years of instruction teachers will be evaluated. One of the major hurdles in applying VAMs to teacher evaluation is that teachers, especially in the elementary grades, often have very small classrooms. While more years teaching in the district will increase the amount of student data available to evaluate the teacher and perhaps thereby lower the standard errors of teacher effects, the solution is not simply to use seven years of data and assume that there will be substantial improvement in the errors of the estimates. After all, not all teachers will have been teaching for that many years in a district, so there will always be many teachers who have few students and, as a result of that, estimated teacher effects with larger standard errors. Furthermore, there is the question of whether it is appropriate to use seven-year-old data for current teacher evaluations; that question would have to be taken up by those who set policy. The use of student and sometimes teacher characteristics to adjust expected student growth is controversial, with many value-added researchers embracing the idea because the practice may result in greater stability of the estimated teacher effects. It is also purported to correct for influences on student achievement from outside the school environment, so that teachers are fairly evaluated regardless of the composition of their classrooms. However, it is not uncommon for those who make decisions for states and school districts to be more reluctant to include demographic covariates, in order to avoid the appearance of adopting different expectations for different groups of students. While research on the effect of including such covariates is somewhat mixed, it is clear that prior achievement is the single most important one, accounting for much more variance

23 7 in the estimates than demographic covariates do (Ballou et al., 24; Lockwood et al., 27). Statistical control for student-level characteristics is easily implemented as part of a covariate adjustment model. Purpose of the Study and Research Questions In order to discover information that could provide guidance to policymakers and practitioners in making decisions concerning teacher evaluation systems that incorporate student achievement data, a study was undertaken to evaluate the impact of choices made concerning how student growth is to be modeled, how many years of data are to be used, and whether student characteristics are to be controlled for in the analysis. The study used a three-cohort longitudinal data set from a school district in which reading and mathematics test scores from a vertically-scaled assessment were available for four consecutive years in each cohort, such that growth could be assessed in the third, fourth, and fifth grades. Estimated teacher effects were derived from VAMs using five different metrics for growth, and the resulting rank orderings of the teachers were examined. Research questions for the study included: 1. How do the rank orderings derived from different metrics for growth compare with one another for both (a) single year and (b) multiple year analyses? 2. How do the rank orderings derived using the various growth metrics compare year-to-year between the cohorts? 3. How generalizable are the answers to the questions 1 and 2 above from one test subject to another? These three research questions address various aspects of the application of VAMs to a practical setting. The methods used to address each research question are described specifically in Chapter III.

24 8 CHAPTER II LITERATURE REVIEW This chapter discusses value-added modeling within the broader context of student growth in achievement, beginning with the distinction between status and growth and their use as accountability measures. This introduction is followed by the definition of a growth model and an explanation of the general types of growth models, as categorized by different researchers. The key distinction between growth models and VAMs is given; this is followed by a discussion of applications and considerations for several models. Ongoing concerns about bias, error, and stability in the estimates generated by VAMs are described next. Finally, considerations for those involved in the implementation of teacher evaluation systems that incorporate student achievement data are addressed. Status versus Growth As accountability systems in education have evolved over time due to changes in the guidance provided by government agencies, there has been a concomitant movement away from a reliance on status measures to the adoption of growth measures (Briggs & Betebenner, 29). The difference between a status measure and a growth measure is a distinction between single and multiple snapshots of student achievement. Castellano and Ho (213a) define status as the academic performance of a student or group (a collection of students) at a single point in time, and they define growth as the academic performance of a student or group (a collection of students) over two or more time points. It was felt that status measures, such as yearly average performance, were not sufficient for the purpose of accountability and that student change over time would be a better measure. With growth measures, each student s progress could be compared against that student s own achievement in the previous year rather than against a cohort average (Callender, 24).

25 9 Growth Models Castellano and Ho (213a) define a growth model as a collection of definitions, calculations, or rules that summarizes student performance over two or more time points and supports interpretations about students, their classrooms, their educators, or their schools. The authors also classify growth models according to several criteria. One such classification is made according to the primary interpretations growth models support, which include growth description, growth prediction, and value-added. Another useful classification system is based on the statistical foundations underlying the growth model, in which three categories are proposed: gain-based models, conditional status models, and multivariate models. The first of these statistical foundations supports models that use a gain score to quantify growth. A gain score is simply the difference between a test score at one point in time and a test score at another point in time. One essential feature of a test used in the context of a gain-based model is the existence of a vertical scale, which affords a developmental basis for interpretations of growth over successive grade levels. With test scores for all grade levels placed on the same scale, it is possible to compare a student s fall test score from the third grade level to that from the fourth grade level and interpret this difference as the growth the student made over the year in the subject being tested (Castellano & Ho, 213a). The second statistical foundation underlies growth models that allow one to interpret a student s current status in light of what that student s status is expected to be, based on the past scores of that student and others. These are called conditional status models because they refer to the current status conditional on the past status, meaning that they take past test scores into account. This statistical foundation is different from that underlying the gain-based models, wherein growth is assessed from two points in time by the difference of current status and past status, in that current status for this case is compared to an expected status that is arrived at based on past performance and

26 1 potentially other information. Castellano and Ho (213a) give as examples of conditional status models the residual gain model, in which conditional status is defined by the difference of the current score and the score expected given past scores, and the student growth percentile model, in which the expectation is expressed through the percentile rank of the current score in the distribution of scores of students who had the same score at an earlier time. The third statistical foundation described by Castellano and Ho (213a) is the basis for multivariate models that are used primarily to estimate school and teacher effects in value-added applications, as it is not the ideal foundation for the purposes of growth description or prediction. Such models make use of large amounts of data and can be very complex. Perhaps the most widely implemented model of this type is the Educational Value-Added Assessment System, known as SAS EVAAS (Sanders & Horn, 1994); this model requires the use of specialized proprietary software from the SAS Institute (SAS Institute, 212). The perspective offered by Castellano and Ho (213a) concerning the systematic classification of growth models based on their statistical foundations is not intended to be taken as the only correct interpretation. There are other systems to classify growth models according to their statistical foundations. For instance, Briggs and Betebenner (29) assert that all statistical models for test score growth are essentially models of conditional achievement. They note that models can be distinguished from one another based on whether they model student achievement conditional on time or conditional on prior achievement. Models that conceptualize achievement conditional on time are referred to as absolute growth models, and those that conceptualize achievement conditional on prior achievement are referred to as relative growth models. In their scheme, a gain score model is an absolute growth model that is constrained to use scores from only two longitudinal time points. They too note the

27 11 requirement for this model that scores be placed on a vertical scale in order to make meaningful comparisons in an absolute sense (Briggs & Betebenner, 29). These authors assert that the quantity of interest in a relative growth model is the residual, the difference between a student s observed achievement and the achievement that would be predicted given the student s prior achievement. Use of residuals provides a normative interpretation of growth: the residual shows the amount of growth above or below the statistical expectation. Models as different in complexity as simple linear regression models, such as the residual gain model, and multivariate models, such as SAS EVAAS, are relative growth models by this definition. The common foundation underpinning these models is the principle of relative growth, defined as the difference between observed and expected achievement (Briggs & Betebenner, 29). Growth Models versus Value-Added Models Briggs and Betebenner (29) state, the leap from a growth model to what can be called a value-added model is a short one. They also assert that all growth models can be turned into VAMs through three steps. In order for a VAM to be used to generate teacher value-added, the following steps would need to occur. First, one must define what constitutes expected achievement for a student. Second, one must calculate a deviation from the expected achievement that contrasts what has been observed to what would be expected for the student. Third, one must make the inference that this deviation from what would be expected is an expression of the value-added to student achievement by the teacher. Making a similar argument, Castellano and Ho (213a) state, we consider value-added to be an inference, not a model. Others take the view that growth models and VAMs are distinct due to the fact that growth models do not generally control for student background or school factors (Baker et al., 21). They argue that one cannot attribute student growth in achievement to teachers without controlling for the effects of these factors. Yet Castellano and Ho (213a) point out that without the existence of a rigorous experimental design in which,

28 12 among other requirements, students are assigned randomly to classrooms, no model can support value-added inferences on its own. The reality is that in practice, as opposed to in research, most statistical models that have been used to support value-added inferences have tended not to include such predictor variables as race or socioeconomic status measures (National Research Council & National Academy of Education, 21). Four Widely Used Models Hereafter follows a brief description of four models frequently used to characterize student growth for accountability purposes, including teacher evaluation. These are the gain score model, the residual gain/covariate adjustment model, the student growth percentile model, and the SAS EVAAS model. Gain Score Model As noted earlier, a gain score is simply the difference between a test score at one point in time and a test score at another point in time. In the context of accountability, the two time points of interest occur at two grade levels, so the scores need to be placed on a common scale that is in turn representative of increasing competence in the domain being tested. The gain score model is an absolute growth model that describes a student s growth relative to his or her own previous score. As the following example (Castellano & Ho, 213a) shows, the gain score is the difference between the test score at the current time point and the test score at the previous time point. This calculation is depicted graphically in Figure 2.1, where a student s scores in third and fourth grade on a hypothetical vertically-scaled test are shown. This student s scores are marked with black dots, and the gain score is shown by the vertical difference between them. In this case the third grade score, which is 3, is subtracted from the fourth grade score, which is 37, to yield a gain score of +2. Gain scores can be aggregated to the group level by averaging a set of students gain scores, in order to characterize the average change in performance for the group. Most often the average of students individual gain scores can serve as a group-level

29 13 summary statistic for a subset of students, such as those in a particular classroom, school, or district. When the average gain score is positive, one can conclude that the students as a group made positive gains, whereas when the average gain score is negative, one can conclude that the group of students declined overall in their performance. Gain score models can be used for making value-added determinations of teacher effectiveness, by considering the value-added to be the deviation from the average gain in the district. However, some have expressed concern that gain-based models are not the best to use for making value-added inferences, due to the dependence of school effects upon the vertical scaling properties of tests (Briggs & Weeks, 29). Since vertical scales are developed to enable student growth in achievement to be described, and not necessarily to support causal inferences about that growth, Briggs and Weeks (29) argue that some properties of the vertical scale may be poorly suited for the purpose of accountability. For instance, some vertical scales reflect that higher scoring students make greater gains than those who score lower (Castellano & Ho, 213a). Such a vertical scale may correctly describe the observed pattern of growth with respect to initial status, but it does not make for the best accountability tool where growth expectations for all students are required to be equal. On the other hand, note Castellano and Ho (213a), these differential, scale-based expectations for lower-scoring students may be precisely what the accountability model should reflect. Residual Gain/Covariate Adjustment Model Linear regression is a statistical method that allows the prediction of an outcome variable from one or more predictor variables. The residual gain model uses linear regression to predict students expected scores from their prior scores. The residual gain is then calculated as the observed current score minus the expected score determined by the model. The residual is the quantity that describes the amount students scored above or below their expected scores, which were determined by their prior performance.

30 14 The following example, offered by Castellano and Ho (213a), will serve as an illustration of the residual gain model. Suppose there is a sample of eight students in fourth grade with test scores for both the third and fourth grades. Figure 2.2(a) shows a scatterplot for the students third and fourth grade scores, which are: (34,33), (34,3), (34,36), (3,3), (3,36), (3,37), (3,37), and (3,38). The eight students are represented in the plot by solid black dots, and the black line in the figure is the prediction line for fourth grade scores given third grade scores, which is the output of the linear regression method. The prediction line is the least squares best fit of the average fourth grade score across all the third grade scores; thus the line represents the expected fourth grade score at every possible third grade score. For instance, for a student with a third grade score of 3, the model predicts an expected fourth grade score of 364. Determining the expected current score is only the first step in the residual gain model. Figure 2.2(b) illustrates the calculation of the residual gain score, which is the difference between the observed current score and the expected current score. For a particular student whose score in third grade was 3 and in fourth grade was 37, his or her expected fourth grade score is predicted to be 364 by the linear regression line. In this case the expected fourth grade score, which is 364, is subtracted from the observed fourth grade score, which is 37, yielding a residual gain of +11. The typical summary statistic for a group of students is the average residual gain for those students in the same classroom, school, or district. The mean residual gain score is expected to be zero across the data set used in the analysis; for any given classroom of the data set, however, the mean residual gain score is not necessarily expected to be zero. The magnitude and sign of the mean residual gain score reveal something about the achievement of the students in the classroom being examined, with respect to expectations for their achievement (Castellano & Ho, 213a).

31 1 When the assumption is made that the average residual gain is the value-added to the average test scores in the group by a teacher or school, the model is a type of VAM called a covariate adjustment model. Like the residual gain model, the covariate adjustment model makes predicted expectations for outcome variables by using one or more predictor variables. The covariate adjustment model is one of the most commonly used models to support value-added interpretations (Castellano & Ho, 213a). Student Growth Percentile Model The student growth percentile (SGP) model describes current student status by taking into account past performance and thus utilizes a conditional status statistical foundation. Since SGPs give the relative position of a student s current score within the conditional distribution of scores from students with similar past performance, the SGP model, like other relative growth models, provides a normative interpretation of growth (Betebenner, 29). As shown in the previous section, the result of the covariate adjustment model is a single line representing the best prediction of the outcome variable using a predictor variable. The solid black line shown in Figure 2.3 is the linear regression line in the example of Castellano and Ho (213a) from the previous section, where the predictor variable is the third grade score and the outcome variable is the fourth grade score. Using a technique called quantile regression, the SGP model fits not just one line, the conditional mean that is the result of linear regression, but rather 99 lines, one for each conditional percentile (1 through 99). Shown in Figure 2.3 by a dashed line is the line for the conditional median (the th line), which represents the best prediction for the median of the fourth grade scores given the third grade scores. Points lying along or closest to this line would be assigned SGPs of. Points lying above the conditional median line would be assigned SGPs higher than, depending on which conditional percentile they are closest to; likewise, points lying below the conditional median line would be assigned SGPs lower than.

32 16 Median SGPs are the most commonly used aggregate SGP metric, which was suggested because SGPs are percentile ranks and on a scale that is not recommended for averaging (Betebenner, 29). However, it has been shown recently that using averages of percentile ranks can support more stable aggregate statistics for SGPs (Castellano & Ho, 213b). Castellano (211) showed that using the mean function may in fact be preferable to the median function when aggregating SGPs, as mean SGPs were found to classify and rank groups more similarly to value-added effects than were median SGPs. SGPs support descriptive interpretations of growth of student groups when aggregated at the classroom, school, or district level. The aggregates summarize how the SGPs are distributed with either an average value or a typical value from the group. According to Betebenner (29), SGPs are not intended to be used to support valueadded interpretations, although it is reported that SGPs derived from quantile regression are strongly correlated with value-added estimates from the SAS EVAAS model (Briggs & Betebenner, 29). Educational Value-Added Assessment System The SAS EVAAS model is an example of a multivariate model primarily designed to support value-added inferences for schools and teachers (Sanders & Horn, 1994). The model considers all available student scores for up to as many as five years, in order to create statistical expectations for performance by tracking students moving through their classrooms and schools over time. Greater or lesser than expected performance can be attributed to the students teachers and schools, with a causal determination of how much each teacher or school contributes to average student performance (Castellano & Ho, 213). In this model, the effect of teachers on student performance is assumed to persist into the future undiminished. That is, the degree to which student performance in third grade is attributable to the third grade teacher persists into fourth grade, fifth grade, and

33 17 on. Because of this feature, the SAS EVAAS model is termed a layered model, as successive teacher effects are layered onto students over time (Braun, 2). Performance expectations are set for students in a particular classroom by considering all these students current test scores and their test scores from before the students enter the classroom and after they leave it; the model also includes the average scores for the district and individual test scores in all other subjects, in addition to teacher effects from other teachers over time. The SAS EVAAS model is complex, incorporates a large amount of information, and requires highly specialized proprietary software to run (SAS Institute, 212). Research on Comparison of Models There is no VAM used in an educational setting that is generally agreed upon as being the best one for accountability decisions, and all VAMs have both favorable and unfavorable features depending upon the context in which they are applied. In any comparison between VAMs that are used on either simulated or real data sets, there is no way to definitively assess which model is producing the correct (or closer to correct) teacher effects, which are assumed to reflect teacher effectiveness in the classroom. In a study that compared a simple fixed effects model (SFEM) that was parameterized as a gain score model, a layered mixed effects model (LMEM) that has similarities to the SAS EVAAS model, and a hierarchical linear mixed model (HLMM), the researchers suggested that policymakers, school districts, and stakeholders would likely prefer the SFEM because of its transparency (Tekwe et al., 24). Three cohorts of elementary school students with test score data in reading and mathematics were used to calculate school effects under these models. The researchers found high correlations between rankings from all these models, ranging from.91 to 1. in reading and from.96 to 1. in mathematics. Since they believed that the SFEM was the more desirable model because it was more easily understandable, the authors concluded there was no benefit to using the other models in this context.