SCORES RELIABILITY AND VALI FERENCES ABOUT TEACHERS BA TUDENT TEST SCORES RELIABIL LIDITY OF INFERENCES ABOUT ERS BASED ON STUDENT TEST S


 Gyles French
 1 years ago
 Views:
Transcription
1 SCORES RELIABILITY AND VALI FERENCES ABOUT TEACHERS BA TUDENT TEST SCORES RELIABIL D VALIDITY OF INFERENCES ABO ERS BASED ON STUDENT TEST S S RELIABILITY AND VALIDITY O LIDITY OF INFERENCES ABOUT ES ABOUT TEACHERS BASED ON D ON STUDENT TEST SCORES RE BILITY AND VALIDITY OF INFERE By Edward H. Haertel UT TEACHERS BASED ON STUDE SCORES RELIABILITY AND VALI FERENCES ABOUT TEACHERS BA TUDENT TEST SCORES RELIABIL BILITY AND VALIDITY OF INFERE
2 William H. Angoff William H. Angoff was a distinguished research scientist at ETS for more than 40 years. During that time, he made many major contributions to educational measurement and authored some of the classic publications on psychometrics, including the definitive text Scales, Norms, and Equivalent Scores, which appeared in Robert L. Thorndike s Educational Measurement. Dr. Angoff was noted not only for his commitment to the highest technical standards but also for his rare ability to make complex issues widely accessible. The Memorial Lecture Series established in his name in 1994 honors Dr. Angoff s legacy by encouraging and supporting the discussion of public interest issues related to educational measurement. These lectures are jointly sponsored by ETS and an endowment fund that was established in Dr. Angoff s memory. The William H. Angoff Lecture Series reports are published by the Center for Research on Human Capital and Education, ETS Research and Development. Copyright 2013 by Educational Testing Service. All rights reserved. ETS, the ETS logo and LISTENING. LEARNING. LEADING. are registered trademarks of Educational Testing Service (ETS)
3 Reliability and Validity of Inferences About Teachers Based on Student Test Scores The 14th William H. Angoff Memorial Lecture was presented at The National Press Club, Washington, D.C., on March 22, Edward H. Haertel Stanford University ETS Research & Development Center for Research on Human Capital and Education Princeton, NJ
4 Preface The 14th William H. Angoff Memorial Lecture was presented by Dr. Edward H. Haertel, Jacks Family Professor of Education, Emeritus, Stanford University. In his lecture, Dr. Haertel examines the use of valueadded models (VAM) in measuring teacher effectiveness. VAMs, complex statistical models for calculating teacher valueadded estimates from patterns of student test scores over time, have been receiving increasing attention as a method for states to revise or establish teacher evaluation systems to take into account the effect of individual teachers on student achievement. These models provide scores for teachers, intended to tell how well each did in raising achievement of their students. Using a test validation methodology in assessing VAMs, Haertel examines questions of validity, reliability, prediction power, and potential positive and negative effects of particular uses of teacher valueadded scores. His lecture, which includes cautionary notes about using valueadded scores in making highstakes decisions, adds to the public policy discussion of teacher performance evaluation methods. The William H. Angoff Memorial Lecture Series was established in 1994 to honor the life and work of Bill Angoff, who died in January For more than 50 years, Dr. Angoff made major contributions to educational and psychological measurement and was deservedly recognized by the major societies in the field. In line with Dr. Angoff s interests, this lecture series is devoted to relatively nontechnical discussions of important public interest issues related to educational measurement. Ida Lawrence Senior Vice President ETS Research & Development September
5 Acknowledgments My thanks go to Robert Mislevy and to Ida Lawrence for the invitation to deliver the 14th William H. Angoff Memorial Lecture, presented March 21, 2013, at ETS in Princeton, New Jersey, and the following day at the National Press Club in Washington, D.C. It has been revised slightly for publication. I am most grateful for thoughtful and constructive comments from several colleagues and reviewers along the way, including Derek Briggs and Jesse Rothstein for their review of an early draft, as well as James Carlson, Daniel McCaffrey, Gary Sykes, and others for their helpful comments on a later version. Their help has been invaluable both in preparing the original talk and in revising it for publication. The views expressed are mine alone, of course, and I am entirely responsible for any remaining errors. Kimberly Ayotte provided outstanding logistical support of all kinds, especially when the lectures had to be cancelled due to Hurricane Sandy and then rescheduled. James Carlson, Richard Coley, and Kim Fryer have provided superb editorial assistance. Abstract Policymakers and school administrators have embraced valueadded models of teacher effectiveness as tools for educational improvement. Teacher valueadded estimates may be viewed as complicated scores of a certain kind. This suggests using a test validation model to examine their reliability and validity. Validation begins with an interpretive argument for inferences or actions based on valueadded scores. That argument addresses (a) the meaning of the scores themselves whether they measure the intended construct; (b) their generalizability whether the results are stable from year to year or using different student tests, for example; and (c) the relation of valueadded scores to broader notions of teacher effectiveness whether teachers effectiveness in raising test scores can serve as a proxy for other aspects of teaching quality. Next, the interpretive argument directs attention to rationales for the expected benefits of particular valueadded score uses or interpretations, as well as plausible unintended consequences. This kind of systematic analysis raises serious questions about some popular policy prescriptions based on teacher valueadded scores. 3
6 Introduction It seems indisputable that U.S. education is in need of reform. Elected officials, school administrators, and federal policymakers are all frustrated with achievement gaps, vast numbers of schools in need of improvement under the No Child Left Behind Act (NCLB, 2002), and a drumbeat of bad news comparing U.S. test scores to those of other nations. It seems we hear daily about declining college and career readiness, 21stcentury skills, and global competitiveness if public education does not improve. At the same time, the belief has spread that research shows just having a top quintile teacher versus a bottom quintile teacher for 5 years in a row could erase the BlackWhite achievement gap (Ravitch, 2010). It is also widely recognized that our ways of identifying and dismissing poorperforming teachers are inadequate, that teacher credentials alone are poor guides to teaching quality, and that teacher evaluation in most school districts around the country is abysmal. What could be more reasonable, then, than looking at students test scores to determine whether or not their teachers are doing a good job? The teacher s job is to teach. Student test scores measure learning. If teachers are teaching, students should learn and scores should go up. If they are teaching well, scores should go up a lot. If test scores are not moving, then the teachers should be held accountable. There are some messy details, of course, in translating student test scores into teacher effectiveness estimates, but sophisticated statistical models, referred to as valueadded models (VAMs), have been created to do just that. Dozens of highly technical articles in leading journals are devoted to these models; data systems linking student test scores over time to individual teachers have improved enormously in recent years. It seems the time has come. Common sense and scientific research both seem to point to teacher evaluation based on VAMs as a powerful strategy for educational improvement. In this lecture, I first comment on the importance of teacher effectiveness and the argument concerning top quintile teachers. I next turn to the importance of sound test score scales for valueadded modeling, followed by the logic of VAMs and the statistical challenges they must overcome. The major portion of these remarks is devoted to describing an interpretive argument (Kane, 2006) for teacher VAM scores and the associated evidence. The interpretive argument is essentially a chain of reasoning from the construction of teacher VAM scores to the inferences those scores are intended to support. This framework is useful in organizing the many different assumptions required to support inferences about comparisons of individual teachers effectiveness based on their students test scores. Finally, I comment briefly on what I believe are more appropriate uses of teacher VAMs and better methods of teacher evaluation. The Angoff Lectures are intended to be relatively nontechnical discussions. I have tried to explain VAMs in terms that any reader with a little patience should be able to follow, but I am afraid a few technical terms will be unavoidable. Most of this lecture is concerned with the suitability of VAMs for teacher evaluation. I believe this use of VAMs has been seriously oversold, and some specific applications have been very unwise. 1 I should state at the outset, however, that like most statistical tools, these models are good for some purposes and not for others. In my conclusions, I comment briefly on what I regard as sound versus unsound uses. 1 See, for example, Winerip (2011). 4
7 How Much Does Teacher Effectiveness Matter? Before getting into the details of VAMs and how they work, let us consider just how much differences in teacher effectiveness really matter for schooling outcomes. Obviously, teachers matter enormously. A classroom full of students with no teacher would probably not learn much at least not much of the prescribed curriculum. But the relevant question here is how much does variation among teachers matter for schooling outcomes? The relevant comparison is not between some teacher and no teacher, but rather between a good teacher in some sense and a poor teacher. Teachers appear to be the most critical withinschool influence on student learning, but outofschool factors have been shown to matter even more. One recent study put the influence of outofschool factors at 60% of the variance in student test scores, and the influence of teachers at around 9% (Goldhaber, Brewer, & Anderson, 1999). 2 Another study, using the Tennessee STAR data, found that teachers accounted for about 13% of the variance in student mathematics test score gains and about 7% of the variance in reading test score gains (Nye, Konstantopoulos, & Hedges, 2004). Some variation is always left unexplained by these models we might refer to it as random variation or random error, but all that really means is that it is not attributable to any of the factors included in a particular model. So let us just say teacher differences account for about 10% of the variance in student test score gains in a single year. As shown in Figure 1, whether 10% is a little or a lot depends on how you look at it. Policymakers who seek to improve schooling outcomes have to focus on potentially changeable determinants of those outcomes. Family background, neighborhood environment, peer influences, and differences in students aptitudes for schooling are seen as largely beyond the reach of educational policy. Relative to just the smaller set of variables that education policies might directly influence, differences in teacher effectiveness appear quite important. In this respect, 10% may seem large. Some proportion of that 10% will remain outside the reach of policy, but on the other hand, cumulative achievement boosts year after year could add up to a somewhat larger effect. However, if the goal is to dramatically change patterns of U.S. student achievement, then identifying and removing lowperforming teachers will not be nearly enough. As my colleague Linda DarlingHammond has quipped, You can t fire your way to Finland ( An Education Exchange, 2011, Teaching Quality Partnerships section, para. 8). Figure 1 How Much Variance in Student Test Score Gains Is Due to Variation Influences Among Student Teachers? Test Scores Influences on Student Test Scores Influences on Student Test Scores Teacher Teacher Other School Factors Other School Factors Out of School Factors Out of School Factors Unexplained Variation Unexplained Variation There is another sense in which 10% is small. It is small relative to the 90% of the variation due to other factors, only some of which can be explained. Simply put, the statistical models used to estimate teacher VAM scores must separate a weak signal from much noise and possible distortion. Models can filter out much of the noise, but in the end, there is still much remaining. 2 Goldhaber et al. (1999) reported that roughly 60% of variance in test scores is explained by individual and family background variables, which included a prior year test score. 5
8 The Myth of the Top Quintile Teachers I mentioned the oftenrepeated story that a string of top quintile teachers versus bottom quintile teachers could erase the BlackWhite achievement gap in 5 years. Some researchers have suggested 4 years, others 3 years (Ravitch, 2010, pp. 181 ff.). Where do these numbers come from? If test score gains are calculated for every student just this year s score minus last year s score and then averaged up to the teacher level, an average test score gain can be obtained for each teacher. (Actual procedures are more complicated, but this will work as a first approximation.) Next, the one fifth of the teachers with the highest average gains can be compared to the one fifth with the lowest gains. The gap between the means for those two groups may be termed the effect of having a top quintile teacher versus a bottom quintile teacher. Suppose that comes out to 5 percentile points. If the BlackWhite achievement gap is 25 percentile points, then one could claim that if a student got a 5point boost each year for 5 years in a row, that would be the size of the gap. This sounds good, but there are at least three reasons why such claims may be exaggerated. Measurement Error Number one, it is not certain who those top quintile teachers really are. Teacher valueadded scores are unreliable. As will be shown, that means the teachers whose students show the biggest gains one year are often not the same as those whose students show big gains the next year. Statistical models can do much better than chance at predicting which teachers students will show aboveaverage gains, but these predictions will still be wrong much of the time. If one cannot be confident about which teachers are the top performers, then the full benefit implied by the logic of the top quintile/bottom quintile argument cannot be realized. Measurement error will lead to unrealistically large teachereffect estimates if the very same student test scores used to calculate teacher valueadded are then used again to estimate the size of the teacher effect. This incorrect procedure amounts to a circular argument, whereby highly effective teachers are defined as those producing high student test score gains and those same students test score gains are then attributed to their having been assigned to highly effective teachers. If a study instead classifies teachers into quintile groups based on their students performance one year and then examines the performance of different students assigned to those teachers in a later year, the estimated quintile effect should correctly incorporate the effects of measurement error. 3 Perhaps the first top quintile claim to attract widespread media attention was a study by Sanders and Rivers (1996). Using data from two urban school districts in Tennessee, these authors predicted a 50 percentile point difference between students assigned to top quintile versus bottom quintile teachers for 3 years in a row. Although the description of their statistical model is incomplete, it appears that measurement error may have led to an inflated estimate in this study, and that their finding was probably overstated (Kupermintz, Shepard, & Linn, 2001). 3 Because teachers estimated valueadded scores always include some measurement error, teachers classified as top quintile or bottom quintile are not truly the most or the least effective. Random error causes some mixing of less effective teachers into the top group and more effective teachers into the bottom group. Thus, teachers classified as top quintile or bottom quintile do not all truly belong in those respective groups, and the effect estimated on the basis of teacher classifications will be smaller than the hypothetical effect attributable to their (unknown) true status. 6
9 FadeOut Problem number two has to do with the idea that one can simply add up gains across years to get a total effect. In fact, the effects of one year s teacher, for good or for ill, fade out in subsequent years. The effects of that wonderful third grade teacher will be much attenuated by the time a student reaches seventh grade. So, the cumulative effect of a string of exceptional teachers will be more than the single year effect, but considerably less than a simple summation would imply. Implementation Challenge Finally, problem number three is simply that there is no way to assign all of the top performing teachers to work with minority students or to replace the current teaching force with all top performers. The thought experiment cannot be translated into an actual policy. Teacher effects do not fade out entirely, of course. In a recent study, Chetty, Friedman, and Rockoff (2011) estimated that about 30% of the teacher effect persists after 3 or 4 years, with little further decline thereafter. They report that this is generally consistent with earlier research, but they are able to provide more accurate and longer term estimates using an exceptionally large longitudinal data set. Their study is also exceptional in detecting elementary school teacher effects lasting even into young adulthood. 7
10 Calculating Test Score Gains I glossed over another challenge in the top quintile/ bottom quintile story when I began with test score gains calculated for each student by simply subtracting last year s score from this year s score. Measuring student achievement is not the same as measuring length or weight. The difference between 2 inches and 4 inches is the same as the difference between 6 inches and 8 inches. That is what is meant by an equalinterval scale. But, it is much harder to be sure that the difference between test scores of 20 and 40 is the same as the difference between scores of 60 and 80. Notice I did not refer to getting 20 items right or 40 items right. Raw scores are pretty much hopeless for these purposes. Test developers use sophisticated statistical models to convert raw scores into scale scores with better statistical properties, but these scale scores are still far from perfect. What does equal interval mean in describing test score scales? Does it mean that on average, it takes the same amount of instructional time or teaching skill to boost a student s score from 20 to 40 as it does from 60 to 80? Probably not, actually. The short answer is that the meaning of equal interval varies according to the score scale s intended use or interpretation, and even for a specific intended use, whether or not a scale is equal interval cannot generally be determined. So why does having an equal interval scale matter? Let us say the score scale is not equal interval. To take just one possible example, let us suppose the units near the top of the scale are actually a little bit smaller than at the bottom of the scale. In that case, as shown in Figure 2, if two teachers students start out at different score levels, on average, and if the teachers would in fact appear equally effective in raising student test scores on an equalinterval scale, then the measured gains for the students in the higherperforming classroom will appear larger. A direct comparison of measured score gains for the two teachers will be unfair. 4 Figure 2 Possible Consequences of a Nonlinear Test Score Scale Measured Growth Growth = = Measured Growth Growth = = 6 points 6 points 7 1/2 7 1/2 points points A A nonlinear scale means teachers are are rewarded or or penalized, depending on on where their students start out out Especially problematical for for teachers of of students above or or below grade level, level, or or with with special needs This is not just a hypothetical argument. Tests aligned to gradelevel standards cannot fully register the academic progress of students far above grade level or far below grade level. If the test is too hard for the students, then they may make much progress and still score around the chance level. And if the test is too easy, students may get nearperfect scores on the pretest and not do much better when they are tested again a year later. That translates into bias against those teachers working with the lowestperforming or the highest performing classes. 5 If tests have an inadequate range of content and difficulty, then bias against some teachers is likely. 4 The statistical challenge is even greater when the equalinterval scale has to span multiple grade levels. If students gains are calculated by subtracting prior year test scores from current year scores, then these gains are probably comparisons between scores on two different tests, built to measure different gradelevel content standards. Calculating gain scores across years requires something called a vertical scale. If tests are not vertically scaled across grades, VAMs cannot rely on gain scores and must instead incorporate prior year test scores in much the same way as any other predictor. This is satisfactory, but to the extent that prior year scores and currentyear scores measure different constructs, accuracy will suffer. Note that the equal interval scale assumption is important whether or not a vertical scale is assumed. 5 Teachers of highperforming students typically earn above average valueadded scores, but anecdotal reports suggest that teachers of gifted and talented classrooms may be penalized because tests are too easy to measure their students progress (AmreinBeardsley & Collins, 2012). 8
11 The Logic of ValueAdded Models In examining the logic of VAMs, it is helpful to begin by considering briefly what is wrong with evaluating teachers just by comparing their students average test scores at the end of the year. Seeing the obvious flaw in that approach should help to clarify the problem the VAM has to solve. Let us begin with something familiar. Think about a typical testing situation, where each student gets a test consisting of a collection of items. The student answers the items; the answers are scored; the item scores are summed to get a test score; and finally, different students test scores are compared to see who ranks high and who ranks low. Next, apply this template to the problem of measuring teacher effectiveness. The comparison is shown in Table 1. This time, think about a testing situation where the examinees are teachers, not students, and where each test item, if you will, is actually a student. The way these studentitems are administered to the teacherexaminees is by having the teacher teach the student for a year. The way the studentitems are scored is by giving each student an achievement test at the end of the year. The way the studentitem scores are summarized is by averaging the students test scores within each classroom. Then, the teachers are compared to one another based on these averages. Now, one can see right away that this is not going to work very well because some teachers will get students who are easier to teach or who know more at the start of the year compared to other teachers. If the group of students in a teacher s classroom for a year is like a test for that teacher, then one might say that some teachers are handed much easier tests, and others are handed much harder tests. So to make the teacher comparisons fairer, one has to adjust for these student differences. This is done by estimating what score each student would have earned, on average, if that student had been taught all year by any other teacher. Then, by comparing the student s actual endofyear score to this estimated score average across all possible teachers, one can adjust for those differences in the students assigned to different teachers. The starting premise is that each student spent the year being taught by one particular teacher. The endofyear scores that would have been observed if that student had instead been taught by some other teacher are each referred to as Table 1 Test Scores for Students Versus ValueAdded Model (VAM) Scores for Teachers Aspect of testing situation Typical test Simplified teacher VAM Examinees Students Teachers Items Test questions Students Test Items in a test form Students in a classroom Administration Student answers items Teacher teaches students Item scoring Item responses scored according to key Student learning scored by giving each student a standardized test Test score Sum of item scores Average of student test scores 9
12 counterfactuals the hypothesized outcomes of events that did not actually happen. The student s average score across all these potential, counterfactual teacher assignments is used as the point of comparison for judging the actual score obtained after the student has spent the year with a particular teacher. Once these counterfactual scores are estimated for each student, one can see whether each student actually performed as well as, better than, or worse than predicted. The estimated average score is subtracted from the observed score, so that a positive difference means better than expected; a negative difference means worse than expected. Then these differences are averaged up to the teacher level. 6 So how does one estimate how well a given student would have scored after spending the year in some other teacher s classroom? One looks for students similar to that given student and assumes that the average observed score for those other students, obtained after their respective years of instruction with various other teachers, gives a good estimate of the average counterfactual score for the given student. Various kinds of information about students can be used in deciding what similar means here. Now this is not the way VAMs are typically described. In practice, to carry out this process of estimating average counterfactual scores for each student, one makes strong statistical assumptions about the functional form of the relationships among various observable student characteristics and achievement test scores whether relationships between variables are best characterized as linear, for example, or with some more complicated mathematical function. Then a technique called regression analysis is used to carry out the estimation for all students at once. The process is often described in the convenient shorthand of controlling for or adjusting for various factors. That language is perfectly fine, of course, but may make it too easy to ignore the underlying logic of the estimation and the strong assumptions the regression model actually entails. Some big differences among various VAMs stem from their choices as to what information to use in controlling or adjusting for student differences. Prior year test scores are included, because these are among the most powerful predictors of currentyear test scores. Students who scored high last year are likely, on average, to score high again this year. Of course, just looking at last year s test score is not enough. VAMs that reach back further in time, including test scores from 2 years earlier as well as from the previous year, are considerably more accurate. Some models just use prior scores from the same subject area, while others pull in test scores from different subject areas. In addition to test scores, some models use students absences, suspensions, grade retentions, English learner or special education status, or summer school attendance. Some models may include gender or other demographic variables describing students. Models may include the average scores of other students in the same classroom or the average score for the entire school or district. All of these choices influence the resulting estimates of how well each individual student would have fared, averaging across all possible teacher assignments. 6 There is an additional technicality in this averaging, which is not of concern here. Because teachers with fewer students are more likely to get extreme valueadded estimates just by chance, some models adjust for the amount of information available about each teacher using socalled shrinkage estimators to make extreme scores less likely. This is another modeling decision that influences the outcomes. Different models give different answers. 10
13 Briggs and Domingue (2011) reanalyzed the data used to generate the teacher effectiveness estimates published by the Los Angeles Times in August of Here is what they said about the statistical model used in the analyses published by the LA Times: The term valueadded is intended to have the same meaning as the term causal effect that is, to speak of estimating the valueadded by a teacher is to speak of estimating the causal effect of that teacher. But once stripped of the Greek symbols and statistical jargon, what we have left is a remarkably simple model that we will refer to as the LAVAM (Los Angeles ValueAdded Model). It is a model which, in essence, claims that once we take into account five pieces of information about a student, the student s assignment to any teacher in any grade and year can be regarded as occurring at random. If that claim is accurate, the remaining differences can be said to be the value added or subtracted by that particular teacher. (Briggs & Domingue, 2011, p. 4) The 5 pieces of information in the LAVAM were test performance in the previous year, gender, English language proficiency, eligibility for Title I services, and whether the student began schooling in the LA Unified School District after kindergarten. In effect, the LAVAM relies on these 5 variables to account for all the systematic differences among the students assigned to different teachers. My point here is not that this particular model is a bad one because it only includes 5 variables, although Briggs and Domingue (2011) did show that teacher rankings changed substantially when an alternative model with some additional control variables was used. (They interpreted their findings as showing that the alternative model had less bias.) The key point here is to understand how VAMs work: They adjust for some set of student characteristics, and sometimes for certain classroom or school characteristics, and then assume that once those adjustments are made, student assignments to teachers are as good as random. Stated a little differently, the goal for the VAM is to strip away just those student differences that are outside of the current teacher s control those things the teacher should not be held accountable for, leaving just those student test score influences the teacher is able to control and therefore should be held accountable for. This is a sensitive business, and different, defensible choices can lead to substantial differences in teachers valueadded rankings. Earlier I offered an analogy of teacher valueadded estimation being like a testing process, in which the teachers are the examinees and the classrooms full of students are like collections of items on different forms of a test. Before leaving that analogy, let me also point out that in any testing situation, common notions of fairness require that all examinees take the test under the same testing conditions. Unlike standardized testing conditions, in the VAM scenario the teacherexaminees may be working under far from equal conditions as they complete their valueadded tests by teaching their students for a year. School climate and resources, teacher peer support, and, of course, the additional instructional support and encouragement students receive both out of school and from other school staff all make the test of teaching much easier for teachers in some schools and harder in others. 11
14 Statistical Assumptions VAMs are complicated, but not nearly so complicated as the reality they are intended to represent. Any feasible VAM must rely on simplifying assumptions, and violations of these assumptions may increase bias or reduce precision of the model s valueadded estimates. Violations of model assumptions also make it more difficult to quantify just how accurate or inaccurate those estimates really are. Hence, these statistical assumptions matter. Effects of Social Stratification Recall that the fundamental challenge is to estimate the average of each student s potential scores across all possible teachers. This is difficult due in part to the socioeconomic stratification in the U.S. school system. Reardon and Raudenbush (2009) pointed out that, given the reality of school segregation on the basis of various demographic characteristics of students, including family socioeconomic background, ethnicity, linguistic background, and prior achievement in practice, some students [may] have no access to certain schools (p. 494). If teachers in some schools have virtually no access to highachieving students from affluent families, and teachers in other schools have similarly limited access to lowachieving students from poor families, then the statistical model is forced to project well beyond the available data in order to estimate potential scores on a common scale for each student with each teacher. For this reason, VAM estimates are least trustworthy when they are used to compare teachers working in very different schools or with very different student populations. Peer Effects Another key assumption holds that a given student s outcome with a given teacher does not depend upon which other students are assigned to that same teacher. This is sometimes stated as no peer effects. 7 One usually thinks about peer effects as arising when students interact with each other. There are peer effects when small groups of students work collaboratively, for example. Or, peer effects are thought of as arising through peer culture whether students reinforce or discourage one another s academic efforts. These kinds of effects are important, of course, but for valueadded modeling, there are two additional kinds of peer effects that may be equally or more important. The first of these has to do with how the members of the class collectively influence the teacher s pacing of instruction, the level at which explanations are pitched, the amount of reading assigned, and so forth. If the teacher is meeting the students where they are, then the average achievement level in the class as a whole is going to influence the amount of content delivered to all of the students over the course of the school year. In the real world of schooling, students are sorted by background and achievement through patterns of residential segregation, and they may also be grouped or tracked within schools. Ignoring this fact is likely to result in penalizing teachers of lowperforming students and favoring teachers of highperforming students, just because the teachers of lowperforming students cannot go as fast. 7 Technically, no peer effects is an implication of the stable unit treatment value assumption (SUTVA). 12
15 Yet another kind of peer effect arises when some students in the classroom directly promote or disrupt the learning of others. Just about every teacher can recall some classes where the chemistry was right perhaps one or two strong students always seemed to ask just the right question at just the right time to move the classroom discussion along. Most teachers can also recall some classes where things did not go so well. Perhaps one or two students were highly disruptive or repeatedly pulled the classroom discussion off topic, wasting precious minutes before the teacher could get the lesson back on track. 8 Simply put, the net result of these peer effects is that VAMs will not simply reward or penalize teachers according to how well or poorly they teach. The will also reward or penalize teachers according to which students they teach and which schools they teach in. Some of these peer effects (e.g., disruptive students) may add random noise to VAM estimates. Others (e.g., effect of average achievement level on pacing) may introduce bias. 9 Adjusting for individual students prior test scores and other background characteristics may mitigate but cannot eliminate this problem. 8 It is, of course, the teacher s responsibility to manage disruptive students, but the fact remains that teacher time spent dealing with such classroom disruptions may affect the learning of all students in the classroom. 9 Some, but not all, VAMs incorporate classroom or schoollevel measures to help control for these kinds of systematic effects. 13
16 An Interpretive Argument for ValueAdded Model (VAM) Teacher Effectiveness Estimates I suggested earlier that one might think of teacher valueadded effectiveness estimates as a complicated broad steps in the interpretive argument (Kane, 2006, p. of Educational Measurement. His analysis laid out four kind of test score. Teachers are the examinees; each 34). My application of this framework to teacher VAM student is like a test item. Assigning a classroom full of score estimation is shown in Table 2. students to a teacher for a year is like giving the teacher The first step is scoring. Here the scoring proposition holds that teacher VAM scores accurately capture a test composed of 30 or so items. Thinking about the entire series of steps involved in valueadded estimation each teacher s effectiveness, with the particular group as a single, complicated measurement process, one can of students that teacher actually taught, as measured consider the validity of VAM scores for any given purpose in much the same way as a testing expert would by the student achievement test actually administered. In other words, each teacher s VAM score captures that consider the validity of any other score. An interpretive teacher s degree of success in imparting the knowledge argument is needed a logical sequence of propositions that, taken together, make the case for the proposed and skills measured by the student achievement test, reasonably undistorted by irrelevant factors. Scoring is use or interpretation. Then, once there is an interpretive the step from the teacher s classroom performance to the argument, the strength of the evidence supporting each teacher s VAM score. proposition must be considered. The second step is generalization, which addresses Perhaps the most authoritative contemporary treatment of test validation is provided by Michael Kane s test reliability. One needs to know how stable VAM scores would be across different possible classes a (2006) chapter, Validation, in the most recent edition Table 2 An Interpretive Argument for Teacher ValueAdded Model (VAM) Scores Stage of interpretive argument Description Focusing question 1. Scoring Observed score 2. Generalization Observed score to universe score 3. Extrapolation Universe score to target score 4. Implication Target score to interpretation or decision Construction of observed VAM score for an individual teacher Generalization to scores that might have been obtained with a different group of students or a parallel form of the same test Extrapolation to teacher effectiveness more broadly construed Soundness of the intended decision or interpretation Is the score unbiased? (i.e., is systematic error acceptably small?) Is the score reliable? (i.e., is random error acceptably small?) Do scores correlate with other kinds of indicators of teaching quality? Do teacher rankings depend heavily on the particular test used? Does achievement test content fully capture valued learning outcomes? How do VAM scores relate to valued nontest (noncognitive) outcomes? Are intended benefits likely to be realized? Have plausible unintended consequences been considered? 14
17 teacher might have taught and also over time. If this year s VAM score gives poor guidance as to a teacher s likely effectiveness next year, then it is not very useful. In the language of test theory, this is the step from the observed score to the universe score the longrun average across imagined repeated measurements. The third step is extrapolation, which directs attention to the relation between the student achievement test actually used and other tests that might have been used instead for capturing student learning outcomes. It also covers the broader question of how well students scores on this test or similar tests can capture the full range of important schooling outcomes. The real target of any measurement is some quality that is broader than test taking per se. In Kane s (2006) terminology, extrapolation is the move from the universe score to that target score. Finally, the fourth step is implication, which directs attention to rationales for the expected benefits of each particular score use or interpretation, as well as plausible unintended consequences. This is the step from the target score to some decision or verbal description. Let us next turn to some of the evidence concerning each of these steps. Scoring will address issues of bias, or systematic error. Generalization will address reliability, or random error. Extrapolation will address the relation between teacher VAM scores and other measures of effectiveness.10 Implication, finally, will take up the question of appropriate and inappropriate uses of VAM scores and their likely consequences. Scoring Recall that the scoring step holds that a teacher s VAM estimate really does tell how effective that teacher was, this year, with these students, in teaching the content measured by this particular achievement test. This means the scoring must be free of systematic bias, the statistical model must reflect reality, and the data must fit the model. The word bias is used in a statistical sense, although here the commonsense meaning of the term is not too far off. Bias refers to errors that do not average out as more information is collected. If teachers in some kinds of schools, or working with some kinds of students, or teaching in certain grades or subject areas tend to get systematically lower or higher VAM estimates, that kind of error will not average out in the long run. The error will tend to show up again and again for a given teacher, in the same direction, year after year, simply because teachers tend to work with similar students year after year, typically in the same or similar schools. Let us consider this question of bias. Jesse Rothstein (2010) published an important paper in which he developed and applied a falsification test for each of three different VAM specifications. Rothstein argued that it is logically impossible for current teacher assignments to influence students test score gains in earlier years. This year s teacher cannot influence last year s achievement. Therefore, if a VAM is run backward in time, using current teacher assignments to predict students score gains in earlier years, it ought to show that the true variance of prior year teacher effects, discounting random error, is near zero. This is called a falsification test because if the analysis does estimate substantial variance for prior 10 Another important dimension of extrapolation is related to the assumption that a teacher s effectiveness with one sort of students is predictive of that teacher s effectiveness with different sorts of students. The assumption that a teacher has some effectiveness independent of the kinds of students that teacher is working with is important, but largely unexamined. 15
18 year teacher effects, then those estimates have to be biased. Such a finding strongly suggests that currentyear teacher effect estimates may also be biased, although it does not prove the existence of bias. 11 Rothstein (2010) tried this out using data from fifth grade classrooms in North Carolina. His sample included more than 60,000 students in more than 3,000 classrooms in 868 schools. He tried several different VAMs and consistently found that fifth grade teacher assignments showed powerful effects on third to fourth grade test score gains. Briggs and Domingue (2011) used Rothstein s test to look at the data on teachers from the LA Unified School District the same data set Richard Buddin used to estimate the first round of teacher valueadded scores published by the Los Angeles Times in August On the reading test, they found that teachers estimated effects on their students gains during a previous school year were about as large as their estimated effects on score gains during the current year. On a mathematics test, the logically impossible prior year effects came out around two thirds as large as for the current year. In one comparison, the estimated effects of fourth grade teachers on third grade reading gains were slightly larger than those teachers estimated effects on fourth grade reading gains. Similar findings have emerged in other studies. How can this be? As stated earlier, one reason is the massively nonrandom grouping of students, both within and between schools, as a function of family socioeconomic background and other factors. This clearly has the potential to distort teacher effectiveness estimates coming out of VAMs. Nonrandom assignment might also take the form of assigning struggling readers to reading specialists or English learners to bilingual teachers. Bias is also possible due to differences in the schools where teachers work. Not all schools are equally conducive to student learning. Bias may come about because peer effects are not fully accounted for. Some limited evidence suggests that bias in VAMs may not be a serious problem (e.g., Chetty et al., 2011; Kane, McCaffrey, Miller, & Staiger, 2013). However, like all studies, each of these has some weaknesses and limitations. 12 Moreover, the fact that no bias is detected in one VAM application is no guarantee that bias may not exist in some other setting. Another significant concern arises because the student achievement tests often used to date have been those mandated by NCLB (2002), which by law are limited to testing content at grade level. That means that teachers of gifted and talented classes may be unable to earn high valueadded scores because their above grade level students are topping out on the tests and simply cannot demonstrate any further score gains. Likewise, teachers whose students are far below grade level may be penalized because the content they are teaching to meet their students needs does not show up on the tests used to measure student growth. Yet another potential source of bias is related to summer learning loss (see Figure 3). Jennifer Sloan Mc 11 Goldhaber and Chaplin (2012) analyzed the conditions under which it is possible for one of Rothstein s specifications to yield a nonnull finding even if currentyear effect estimates are unbiased and called for further investigation. Chetty et al. (2011) implemented a quasiexperimental test for selection on unobservables, based on teacher switching between schools, and also concluded that, although they replicated Rothstein s results, this does not in fact imply that their estimates of longterm teacher effects are biased. 12 The Chetty et al. (2011) study relied on student test data collected under relatively lowstakes conditions, which limits its applicability to VAMs with high stakes for teachers. The MET Project randomization study by Kane et al. (2013) examined random student assignment under rather constrained conditions and also suffered from problems of attrition and noncompliance. These problems limited its power to detect bias due to student assignment. 16
19 Combs and her colleagues at the RAND Corporation (McCombs et al., 2011) recently reviewed the research on summer learning loss. They concluded that on average, elementary school students lose about 1 month of learning over the summer months, from spring to fall. Losses are somewhat larger for mathematics, somewhat smaller for reading. But more importantly, these losses are not the same for all students. On average, students from higher income families actually post gains in reading achievement over the summer months, while their peers from lower income families post losses. This suggests a potential distortion in comparisons of VAM estimates among teachers whose students come from different economic backgrounds. On average, reading scores from the previous spring will underestimate the initial autumn proficiency of students in more advantaged classrooms and overestimate the initial autumn proficiency of those in less advantaged classrooms. Even if the two groups of students in fact make equal falltospring gains, their measured prior springtospring gains may differ. Some of this difference may be accounted for in VAMs that include adjustments for demographic factors, but once again, it appears likely that valueadded estimates may be biased in favor of some teachers and against others. Figure 3 Summer Learning Loss Is Not the Same for Students From Less Affluent Versus More Affluent Families Measured Springto Spring test score gain Spring to Fall (summer) loss or gain = + Low income families: Summer learning loss Spring to spring gain understates school year gain High income families: Summer learning gain in reading Spring to spring gain overstates school year gain These concerns must be balanced against compelling empirical evidence that teacher VAM scores are capturing some important elements of teaching quality. In particular, Chetty et al. (2011) recently reported that teachers VAM scores predicted their students future college attendance, earnings, socioeconomic status, and even teenage pregnancy rates. 13 Their study included creative statistical tests for bias due to omitted variables, and they found no bias. Similarly, Goldhaber and Hansen (2010) have reported modest but statistically significant effects of teacher VAM estimates on student test scores several years later. Teacher VAM scores are certainly not just random noise. These models appear to capture important differences in teachers effects on student learning outcomes. But even the best models are not pure measures of teacher effectiveness. VAM scores do predict important student learning outcomes, but my reading of the evidence strongly suggests that these scores nonetheless measure not only how well teachers teach, but also whom and where they teach. Fall to Spring (school year) gain 13 The study by Chetty et al. (2011) is very carefully done, but relied on data collected in a context in which no particularly high stakes were attached to student test scores. Even in that context, the authors set aside the top 2% of teacher VAM scores because these teachers impacts on test scores appear suspiciously consistent with testing irregularities indicative of cheating (Chetty et al., 2011, p. 23). When these teachers were included in the analysis, estimated longterm teacher effects were reduced by roughly 20% to 40%. 17
20 Generalization The second link in the chain of propositions needed to support VAM scores is generalization, the step from observed score to universe score. The first proposition, scoring, focused on the question of what valueadded scores were measuring, including the question of whether those scores were free from Next Year systematic Distribution bias. Generalization shifts attention from Year s what Bottom to how well Quintile and from 25 of One 30 systematic error to random Elementary error. It focuses Teachers, on the question of how stable or unstable Next Year Florida in Five teacher Counties Distribution VAM scores turn of One out to be. This is the familiar Year s issue Bottom of score Quintile reliability. 250 Bottom Top Elementary Teachers, in Five Quintile Quintile One very good way to Florida estimate Counties reliability is just to correlate valueadded scores from two points in time, or 40 5 NextYear 35 0 Distribution of One Year s Top Quintile from two sections of the same Next Year class. The Distribution correlation of itself is the same as a reliability One Elementary 30 Bottom Teachers, 2 in 53Florida Counties 4 Top Year s coefficient. Top Quintile 25 Quintile Quintile Several years Elementary Teachers, in Five 15 ago, Daniel McCaffrey and his coauthors investigated 40 Florida Counties a variety of VAM specifications Next Year and data Distribution sets and found of One yeartoyear correlations mostly Year s between Top Quintile 25.2 and.4, with 20 Bottom Top Elementary Teachers, in Five 15 Quintile Quintile a few lower and a few higher (McCaffrey, Sass, Lockwood, & Mihaly, 2009). More specifically, they looked Florida Counties 10 5 at valueadded scores for teachers in five different counties in Florida. Figure 4 illustrates some of their findings for elementary school teachers. They found that in each county, a minimum of 10% of the teachers in the bottom fifth of the distribution one year were in the top fifth the next year, and conversely. Typically, only about a third of 1 year s top performers were in the top category again the following year, and likewise, only about a third of 1 year s lowest performers were in the lowest category again the following year. These findings are typical. A few studies have found reliabilities around.5 or a little higher (e.g., Koedel & Betts, 2007), but this still says that only half the variation in these valueadded estimates is signal, and the remainder is noise. Figure 4 YeartoYear Changes in Teacher ValueAdded Rankings Reported by McCaffrey et al. (2009, Table 4, p. 591) Percent Percent Percent Percent NextYear Distribution of One Year s Bottom Quintile Elementary Teachers, in 5 Florida Counties Bottom Quintile Top Quintile Dade Duval Hillsborough Orange Palm DadeBeach Duval Hillsborough Orange Palm Beach Dade Duval Hillsborough Orange Palm DadeBeach Duval Hillsborough Orange Palm Beach McCaffrey and his colleagues (2009) pointed out that yeartoyear changes in teachers scores reflected both the vagaries of student sampling and actual changes in teachers effectiveness from year to year. But if one wants to know how useful one year s score is for predicting the next year s score, that distinction does not matter. McCaffrey et al. s results imply that unstable or random components together account for more than half the variability in VAM scores, and in some cases as much as 80% or more. Sorting teachers according to single year valueadded scores is sorting mostly on noise. 18
HAS STUDENT ACHIEVEMENT INCREASED SINCE NO CHILD LEFT BEHIND?
ANSWERING THE QUESTION THAT MATTERS MOST HAS STUDENT ACHIEVEMENT INCREASED SINCE NO CHILD LEFT BEHIND? Center on Education Policy JUNE 2007 ANSWERING THE QUESTION THAT MATTERS MOST HAS STUDENT ACHIEVEMENT
More informationSIMILAR STUDENTS, DIFFERENT RESULTS: Why Do Some Schools Do Better?
REPORT JUNE 2006 SIMILAR STUDENTS, DIFFERENT RESULTS: Why Do Some Schools Do Better? The summary report from a largescale survey of California elementary schools serving lowincome students Some people
More informationHave We Identified Effective Teachers?
MET project Research Paper Have We Identified Effective Teachers? Validating Measures of Effective Teaching Using Random Assignment Thomas J. Kane Daniel F. McCaffrey Trey Miller Douglas O. Staiger ABOUT
More informationSean P. Corcoran. in collaboration with Annenberg Institute research staff
EDUCATION POLICY FOR ACTION SERIES E D U C A T I O N C H A L L E N G E S F A C I N G N E W Y O R K C I T Y Can Teachers be Evaluated by their Students Test Scores? Should They Be? The Use of ValueAdded
More informationWhat LargeScale, Survey Research Tells Us About Teacher Effects On Student Achievement: Insights from the Prospects Study of Elementary Schools
What LargeScale, Survey Research Tells Us About Teacher Effects On Student Achievement: Insights from the Prospects Study of Elementary Schools Brian Rowan, Richard Correnti, and Robert J. Miller CPRE
More informationHow and Why Do Teacher Credentials Matter for Student Achievement? C h a r l e s T. Clotfelter
How and Why Do Teacher Credentials Matter for Student Achievement? C h a r l e s T. Clotfelter H e l e n F. Ladd J a c o b L. Vigdor w o r k i n g p a p e r 2 m a r c h 2 0 0 7 How and why do teacher credentials
More informationEnsuring Fair and Reliable Measures of Effective Teaching
MET project Policy and practice Brief Ensuring Fair and Reliable Measures of Effective Teaching Culminating Findings from the MET Project s ThreeYear Study ABOUT THIS REPORT: This nontechnical research
More informationWhat Do We Know About the Outcomes of KIPP Schools?
What Do We Know About the Outcomes of KIPP Schools? Jeffrey R. Henig, Ph.D. Teachers College, Columbia University The Great Lakes Center for Education Research & Practice PO Box 1263 East Lansing, MI 48826
More informationDevelopmentally Appropriate Practice in Early Childhood Programs Serving Children from Birth through Age 8
Position Statement Developmentally Appropriate Practice in Early Childhood Programs Serving Children from Birth through Age 8 Adopted 2009 A position statement of the National Asssociation for the Education
More informationPerfect For RTI. Getting the Most out of. STAR Math. Using data to inform instruction and intervention
Perfect For RTI Getting the Most out of STAR Math Using data to inform instruction and intervention The Accelerated products design, STAR Math, STAR Reading, STAR Early Literacy, Accelerated Math, Accelerated
More informationClimate Surveys: Useful Tools to Help Colleges and Universities in Their Efforts to Reduce and Prevent Sexual Assault
Climate Surveys: Useful Tools to Help Colleges and Universities in Their Efforts to Reduce and Prevent Sexual Assault Why are we releasing information about climate surveys? Sexual assault is a significant
More informationWhat makes great teaching?
What makes great teaching? Review of the underpinning research Robert Coe, Cesare Aloisi, Steve Higgins and Lee Elliot Major October 2014 Executive Summary A framework for professional learning This review
More informationToward a More Comprehensive Conception of College Readiness
Toward a More Comprehensive Conception of College Readiness 2007 David T. Conley Prepared for the Bill & Melinda Gates Foundation, March 2007 Table of Contents Introduction. 5 An Operational Definition
More informationConclusions and Controversies about the Effectiveness of School Resources
Conclusions and Controversies about the Effectiveness of School Resources Eric A. Hanushek Both the U.S. public and U.S. policymakers pursue a lovehate relationship with U.S. schools. While a majority
More informationIs This a Trick Question? A Short Guide to Writing Effective Test Questions
Is This a Trick Question? A Short Guide to Writing Effective Test Questions Is This a Trick Question? A Short Guide to Writing Effective Test Questions Designed & Developed by: Ben Clay Kansas Curriculum
More informationHow Much Can We Boost IQ and Scholastic Achievement?
How Much Can We Boost IQ and Scholastic Achievement? ARTHUR R. JENSEN University of California, Berkeley Originally published in Harvard Educational Review, Vol. 39, No. 1, Winter 1969, pages 1123. Arthur
More information10 things ev ery. liter ac y educator shou ld k now abou t resea rch
T H E I N S I D E T R A C K 9 10 things ev ery liter ac y educator shou ld k now abou t resea rch Nell K. Duke Nicole M. Martin researchproven, scientifically based in the reading world these days, it
More informationAn Introduction to Regression Analysis
The Inaugural Coase Lecture An Introduction to Regression Analysis Alan O. Sykes * Regression analysis is a statistical tool for the investigation of relationships between variables. Usually, the investigator
More informationIs that paper really due today? : differences in firstgeneration and traditional college students understandings of faculty expectations
DOI 10.1007/s1073400790655 Is that paper really due today? : differences in firstgeneration and traditional college students understandings of faculty expectations Peter J. Collier Æ David L. Morgan
More informationUse of Randomization in the Evaluation of Development Effectiveness 1
Duflo and Kremer 1 Use of Randomization in the Evaluation of Development Effectiveness 1 Esther Duflo 2 Michael Kremer 3 Paper prepared for the World Bank Operations Evaluation Department (OED) Conference
More informationDeveloping Kindergarten Readiness and Other LargeScale Assessment Systems
naeyc Center for applied research Developing Kindergarten Readiness and Other LargeScale Assessment Systems Necessary Considerations in the Assessment of Young Children by Kyle Snow, PhD December 2011
More informationThe No Child Left Behind Act (NCLB; Public Law No.
The No Child Left Behind Act and English Language Learners: Assessment and Accountability Issues by Jamal Abedi There are major issues involved with the disaggregated No Child Left Behind (NCLB) Act in
More informationThe revision, based on the review team s collective input, includes a number of positive changes. Based on your guidance, we:
GENERAL COMMENTS TO THE REFEREES AND EDITOR: We are very excited to have been given the opportunity to revise our manuscript, which we now entitle, Competition and Scholarly Productivity in Management:
More informationCorrespondences Between ACT and SAT I Scores
College Board Report No. 991 Correspondences Between ACT and SAT I Scores NEIL J. DORANS College Board Report No. 991 ETS RR No. 992 Correspondences Between ACT and SAT I Scores NEIL J. DORANS College
More informationOtherRegarding Preferences: A Selective Survey of Experimental Results* David J. Cooper Florida State University. John H. Kagel Ohio State University
OtherRegarding Preferences: A Selective Survey of Experimental Results* David J. Cooper Florida State University John H. Kagel Ohio State University 2/12/2013 To appear in the Handbook of Experimental
More informationScaling Up and Evaluation
Scaling Up and Evaluation ESTHER DUFLO This paper discusses the role that impact evaluations should play in scaling up. Credible impact evaluations are needed to ensure that the most effective programs
More informationInto the Eye of the Storm: Assessing the Evidence on Science and Engineering Education, Quality, and Workforce Demand
Into the Eye of the Storm: Assessing the Evidence on Science and Engineering Education, Quality, and Workforce Demand October 2007 B. Lindsay Lowell Georgetown University lowellbl@georgetown.edu Hal Salzman
More informationEvaluation. valuation of any kind is designed to document what happened in a program.
Using Case Studies to do Program Evaluation E valuation of any kind is designed to document what happened in a program. Evaluation should show: 1) what actually occurred, 2) whether it had an impact, expected
More informationMisunderstandings between experimentalists and observationalists about causal inference
J. R. Statist. Soc. A (2008) 171, Part 2, pp. 481 502 Misunderstandings between experimentalists and observationalists about causal inference Kosuke Imai, Princeton University, USA Gary King Harvard University,
More informationWHAT COMMUNITY COLLEGE DEVELOPMENTAL MATHEMATICS STUDENTS UNDERSTAND ABOUT MATHEMATICS
THE CARNEGIE FOUNDATION FOR THE ADVANCEMENT OF TEACHING Problem Solution Exploration Papers WHAT COMMUNITY COLLEGE DEVELOPMENTAL MATHEMATICS STUDENTS UNDERSTAND ABOUT MATHEMATICS James W. Stigler, Karen
More information