SCORES RELIABILITY AND VALI FERENCES ABOUT TEACHERS BA TUDENT TEST SCORES RELIABIL LIDITY OF INFERENCES ABOUT ERS BASED ON STUDENT TEST S


 Gyles French
 2 years ago
 Views:
Transcription
1 SCORES RELIABILITY AND VALI FERENCES ABOUT TEACHERS BA TUDENT TEST SCORES RELIABIL D VALIDITY OF INFERENCES ABO ERS BASED ON STUDENT TEST S S RELIABILITY AND VALIDITY O LIDITY OF INFERENCES ABOUT ES ABOUT TEACHERS BASED ON D ON STUDENT TEST SCORES RE BILITY AND VALIDITY OF INFERE By Edward H. Haertel UT TEACHERS BASED ON STUDE SCORES RELIABILITY AND VALI FERENCES ABOUT TEACHERS BA TUDENT TEST SCORES RELIABIL BILITY AND VALIDITY OF INFERE
2 William H. Angoff William H. Angoff was a distinguished research scientist at ETS for more than 40 years. During that time, he made many major contributions to educational measurement and authored some of the classic publications on psychometrics, including the definitive text Scales, Norms, and Equivalent Scores, which appeared in Robert L. Thorndike s Educational Measurement. Dr. Angoff was noted not only for his commitment to the highest technical standards but also for his rare ability to make complex issues widely accessible. The Memorial Lecture Series established in his name in 1994 honors Dr. Angoff s legacy by encouraging and supporting the discussion of public interest issues related to educational measurement. These lectures are jointly sponsored by ETS and an endowment fund that was established in Dr. Angoff s memory. The William H. Angoff Lecture Series reports are published by the Center for Research on Human Capital and Education, ETS Research and Development. Copyright 2013 by Educational Testing Service. All rights reserved. ETS, the ETS logo and LISTENING. LEARNING. LEADING. are registered trademarks of Educational Testing Service (ETS)
3 Reliability and Validity of Inferences About Teachers Based on Student Test Scores The 14th William H. Angoff Memorial Lecture was presented at The National Press Club, Washington, D.C., on March 22, Edward H. Haertel Stanford University ETS Research & Development Center for Research on Human Capital and Education Princeton, NJ
4 Preface The 14th William H. Angoff Memorial Lecture was presented by Dr. Edward H. Haertel, Jacks Family Professor of Education, Emeritus, Stanford University. In his lecture, Dr. Haertel examines the use of valueadded models (VAM) in measuring teacher effectiveness. VAMs, complex statistical models for calculating teacher valueadded estimates from patterns of student test scores over time, have been receiving increasing attention as a method for states to revise or establish teacher evaluation systems to take into account the effect of individual teachers on student achievement. These models provide scores for teachers, intended to tell how well each did in raising achievement of their students. Using a test validation methodology in assessing VAMs, Haertel examines questions of validity, reliability, prediction power, and potential positive and negative effects of particular uses of teacher valueadded scores. His lecture, which includes cautionary notes about using valueadded scores in making highstakes decisions, adds to the public policy discussion of teacher performance evaluation methods. The William H. Angoff Memorial Lecture Series was established in 1994 to honor the life and work of Bill Angoff, who died in January For more than 50 years, Dr. Angoff made major contributions to educational and psychological measurement and was deservedly recognized by the major societies in the field. In line with Dr. Angoff s interests, this lecture series is devoted to relatively nontechnical discussions of important public interest issues related to educational measurement. Ida Lawrence Senior Vice President ETS Research & Development September
5 Acknowledgments My thanks go to Robert Mislevy and to Ida Lawrence for the invitation to deliver the 14th William H. Angoff Memorial Lecture, presented March 21, 2013, at ETS in Princeton, New Jersey, and the following day at the National Press Club in Washington, D.C. It has been revised slightly for publication. I am most grateful for thoughtful and constructive comments from several colleagues and reviewers along the way, including Derek Briggs and Jesse Rothstein for their review of an early draft, as well as James Carlson, Daniel McCaffrey, Gary Sykes, and others for their helpful comments on a later version. Their help has been invaluable both in preparing the original talk and in revising it for publication. The views expressed are mine alone, of course, and I am entirely responsible for any remaining errors. Kimberly Ayotte provided outstanding logistical support of all kinds, especially when the lectures had to be cancelled due to Hurricane Sandy and then rescheduled. James Carlson, Richard Coley, and Kim Fryer have provided superb editorial assistance. Abstract Policymakers and school administrators have embraced valueadded models of teacher effectiveness as tools for educational improvement. Teacher valueadded estimates may be viewed as complicated scores of a certain kind. This suggests using a test validation model to examine their reliability and validity. Validation begins with an interpretive argument for inferences or actions based on valueadded scores. That argument addresses (a) the meaning of the scores themselves whether they measure the intended construct; (b) their generalizability whether the results are stable from year to year or using different student tests, for example; and (c) the relation of valueadded scores to broader notions of teacher effectiveness whether teachers effectiveness in raising test scores can serve as a proxy for other aspects of teaching quality. Next, the interpretive argument directs attention to rationales for the expected benefits of particular valueadded score uses or interpretations, as well as plausible unintended consequences. This kind of systematic analysis raises serious questions about some popular policy prescriptions based on teacher valueadded scores. 3
6 Introduction It seems indisputable that U.S. education is in need of reform. Elected officials, school administrators, and federal policymakers are all frustrated with achievement gaps, vast numbers of schools in need of improvement under the No Child Left Behind Act (NCLB, 2002), and a drumbeat of bad news comparing U.S. test scores to those of other nations. It seems we hear daily about declining college and career readiness, 21stcentury skills, and global competitiveness if public education does not improve. At the same time, the belief has spread that research shows just having a top quintile teacher versus a bottom quintile teacher for 5 years in a row could erase the BlackWhite achievement gap (Ravitch, 2010). It is also widely recognized that our ways of identifying and dismissing poorperforming teachers are inadequate, that teacher credentials alone are poor guides to teaching quality, and that teacher evaluation in most school districts around the country is abysmal. What could be more reasonable, then, than looking at students test scores to determine whether or not their teachers are doing a good job? The teacher s job is to teach. Student test scores measure learning. If teachers are teaching, students should learn and scores should go up. If they are teaching well, scores should go up a lot. If test scores are not moving, then the teachers should be held accountable. There are some messy details, of course, in translating student test scores into teacher effectiveness estimates, but sophisticated statistical models, referred to as valueadded models (VAMs), have been created to do just that. Dozens of highly technical articles in leading journals are devoted to these models; data systems linking student test scores over time to individual teachers have improved enormously in recent years. It seems the time has come. Common sense and scientific research both seem to point to teacher evaluation based on VAMs as a powerful strategy for educational improvement. In this lecture, I first comment on the importance of teacher effectiveness and the argument concerning top quintile teachers. I next turn to the importance of sound test score scales for valueadded modeling, followed by the logic of VAMs and the statistical challenges they must overcome. The major portion of these remarks is devoted to describing an interpretive argument (Kane, 2006) for teacher VAM scores and the associated evidence. The interpretive argument is essentially a chain of reasoning from the construction of teacher VAM scores to the inferences those scores are intended to support. This framework is useful in organizing the many different assumptions required to support inferences about comparisons of individual teachers effectiveness based on their students test scores. Finally, I comment briefly on what I believe are more appropriate uses of teacher VAMs and better methods of teacher evaluation. The Angoff Lectures are intended to be relatively nontechnical discussions. I have tried to explain VAMs in terms that any reader with a little patience should be able to follow, but I am afraid a few technical terms will be unavoidable. Most of this lecture is concerned with the suitability of VAMs for teacher evaluation. I believe this use of VAMs has been seriously oversold, and some specific applications have been very unwise. 1 I should state at the outset, however, that like most statistical tools, these models are good for some purposes and not for others. In my conclusions, I comment briefly on what I regard as sound versus unsound uses. 1 See, for example, Winerip (2011). 4
7 How Much Does Teacher Effectiveness Matter? Before getting into the details of VAMs and how they work, let us consider just how much differences in teacher effectiveness really matter for schooling outcomes. Obviously, teachers matter enormously. A classroom full of students with no teacher would probably not learn much at least not much of the prescribed curriculum. But the relevant question here is how much does variation among teachers matter for schooling outcomes? The relevant comparison is not between some teacher and no teacher, but rather between a good teacher in some sense and a poor teacher. Teachers appear to be the most critical withinschool influence on student learning, but outofschool factors have been shown to matter even more. One recent study put the influence of outofschool factors at 60% of the variance in student test scores, and the influence of teachers at around 9% (Goldhaber, Brewer, & Anderson, 1999). 2 Another study, using the Tennessee STAR data, found that teachers accounted for about 13% of the variance in student mathematics test score gains and about 7% of the variance in reading test score gains (Nye, Konstantopoulos, & Hedges, 2004). Some variation is always left unexplained by these models we might refer to it as random variation or random error, but all that really means is that it is not attributable to any of the factors included in a particular model. So let us just say teacher differences account for about 10% of the variance in student test score gains in a single year. As shown in Figure 1, whether 10% is a little or a lot depends on how you look at it. Policymakers who seek to improve schooling outcomes have to focus on potentially changeable determinants of those outcomes. Family background, neighborhood environment, peer influences, and differences in students aptitudes for schooling are seen as largely beyond the reach of educational policy. Relative to just the smaller set of variables that education policies might directly influence, differences in teacher effectiveness appear quite important. In this respect, 10% may seem large. Some proportion of that 10% will remain outside the reach of policy, but on the other hand, cumulative achievement boosts year after year could add up to a somewhat larger effect. However, if the goal is to dramatically change patterns of U.S. student achievement, then identifying and removing lowperforming teachers will not be nearly enough. As my colleague Linda DarlingHammond has quipped, You can t fire your way to Finland ( An Education Exchange, 2011, Teaching Quality Partnerships section, para. 8). Figure 1 How Much Variance in Student Test Score Gains Is Due to Variation Influences Among Student Teachers? Test Scores Influences on Student Test Scores Influences on Student Test Scores Teacher Teacher Other School Factors Other School Factors Out of School Factors Out of School Factors Unexplained Variation Unexplained Variation There is another sense in which 10% is small. It is small relative to the 90% of the variation due to other factors, only some of which can be explained. Simply put, the statistical models used to estimate teacher VAM scores must separate a weak signal from much noise and possible distortion. Models can filter out much of the noise, but in the end, there is still much remaining. 2 Goldhaber et al. (1999) reported that roughly 60% of variance in test scores is explained by individual and family background variables, which included a prior year test score. 5
8 The Myth of the Top Quintile Teachers I mentioned the oftenrepeated story that a string of top quintile teachers versus bottom quintile teachers could erase the BlackWhite achievement gap in 5 years. Some researchers have suggested 4 years, others 3 years (Ravitch, 2010, pp. 181 ff.). Where do these numbers come from? If test score gains are calculated for every student just this year s score minus last year s score and then averaged up to the teacher level, an average test score gain can be obtained for each teacher. (Actual procedures are more complicated, but this will work as a first approximation.) Next, the one fifth of the teachers with the highest average gains can be compared to the one fifth with the lowest gains. The gap between the means for those two groups may be termed the effect of having a top quintile teacher versus a bottom quintile teacher. Suppose that comes out to 5 percentile points. If the BlackWhite achievement gap is 25 percentile points, then one could claim that if a student got a 5point boost each year for 5 years in a row, that would be the size of the gap. This sounds good, but there are at least three reasons why such claims may be exaggerated. Measurement Error Number one, it is not certain who those top quintile teachers really are. Teacher valueadded scores are unreliable. As will be shown, that means the teachers whose students show the biggest gains one year are often not the same as those whose students show big gains the next year. Statistical models can do much better than chance at predicting which teachers students will show aboveaverage gains, but these predictions will still be wrong much of the time. If one cannot be confident about which teachers are the top performers, then the full benefit implied by the logic of the top quintile/bottom quintile argument cannot be realized. Measurement error will lead to unrealistically large teachereffect estimates if the very same student test scores used to calculate teacher valueadded are then used again to estimate the size of the teacher effect. This incorrect procedure amounts to a circular argument, whereby highly effective teachers are defined as those producing high student test score gains and those same students test score gains are then attributed to their having been assigned to highly effective teachers. If a study instead classifies teachers into quintile groups based on their students performance one year and then examines the performance of different students assigned to those teachers in a later year, the estimated quintile effect should correctly incorporate the effects of measurement error. 3 Perhaps the first top quintile claim to attract widespread media attention was a study by Sanders and Rivers (1996). Using data from two urban school districts in Tennessee, these authors predicted a 50 percentile point difference between students assigned to top quintile versus bottom quintile teachers for 3 years in a row. Although the description of their statistical model is incomplete, it appears that measurement error may have led to an inflated estimate in this study, and that their finding was probably overstated (Kupermintz, Shepard, & Linn, 2001). 3 Because teachers estimated valueadded scores always include some measurement error, teachers classified as top quintile or bottom quintile are not truly the most or the least effective. Random error causes some mixing of less effective teachers into the top group and more effective teachers into the bottom group. Thus, teachers classified as top quintile or bottom quintile do not all truly belong in those respective groups, and the effect estimated on the basis of teacher classifications will be smaller than the hypothetical effect attributable to their (unknown) true status. 6
9 FadeOut Problem number two has to do with the idea that one can simply add up gains across years to get a total effect. In fact, the effects of one year s teacher, for good or for ill, fade out in subsequent years. The effects of that wonderful third grade teacher will be much attenuated by the time a student reaches seventh grade. So, the cumulative effect of a string of exceptional teachers will be more than the single year effect, but considerably less than a simple summation would imply. Implementation Challenge Finally, problem number three is simply that there is no way to assign all of the top performing teachers to work with minority students or to replace the current teaching force with all top performers. The thought experiment cannot be translated into an actual policy. Teacher effects do not fade out entirely, of course. In a recent study, Chetty, Friedman, and Rockoff (2011) estimated that about 30% of the teacher effect persists after 3 or 4 years, with little further decline thereafter. They report that this is generally consistent with earlier research, but they are able to provide more accurate and longer term estimates using an exceptionally large longitudinal data set. Their study is also exceptional in detecting elementary school teacher effects lasting even into young adulthood. 7
10 Calculating Test Score Gains I glossed over another challenge in the top quintile/ bottom quintile story when I began with test score gains calculated for each student by simply subtracting last year s score from this year s score. Measuring student achievement is not the same as measuring length or weight. The difference between 2 inches and 4 inches is the same as the difference between 6 inches and 8 inches. That is what is meant by an equalinterval scale. But, it is much harder to be sure that the difference between test scores of 20 and 40 is the same as the difference between scores of 60 and 80. Notice I did not refer to getting 20 items right or 40 items right. Raw scores are pretty much hopeless for these purposes. Test developers use sophisticated statistical models to convert raw scores into scale scores with better statistical properties, but these scale scores are still far from perfect. What does equal interval mean in describing test score scales? Does it mean that on average, it takes the same amount of instructional time or teaching skill to boost a student s score from 20 to 40 as it does from 60 to 80? Probably not, actually. The short answer is that the meaning of equal interval varies according to the score scale s intended use or interpretation, and even for a specific intended use, whether or not a scale is equal interval cannot generally be determined. So why does having an equal interval scale matter? Let us say the score scale is not equal interval. To take just one possible example, let us suppose the units near the top of the scale are actually a little bit smaller than at the bottom of the scale. In that case, as shown in Figure 2, if two teachers students start out at different score levels, on average, and if the teachers would in fact appear equally effective in raising student test scores on an equalinterval scale, then the measured gains for the students in the higherperforming classroom will appear larger. A direct comparison of measured score gains for the two teachers will be unfair. 4 Figure 2 Possible Consequences of a Nonlinear Test Score Scale Measured Growth Growth = = Measured Growth Growth = = 6 points 6 points 7 1/2 7 1/2 points points A A nonlinear scale means teachers are are rewarded or or penalized, depending on on where their students start out out Especially problematical for for teachers of of students above or or below grade level, level, or or with with special needs This is not just a hypothetical argument. Tests aligned to gradelevel standards cannot fully register the academic progress of students far above grade level or far below grade level. If the test is too hard for the students, then they may make much progress and still score around the chance level. And if the test is too easy, students may get nearperfect scores on the pretest and not do much better when they are tested again a year later. That translates into bias against those teachers working with the lowestperforming or the highest performing classes. 5 If tests have an inadequate range of content and difficulty, then bias against some teachers is likely. 4 The statistical challenge is even greater when the equalinterval scale has to span multiple grade levels. If students gains are calculated by subtracting prior year test scores from current year scores, then these gains are probably comparisons between scores on two different tests, built to measure different gradelevel content standards. Calculating gain scores across years requires something called a vertical scale. If tests are not vertically scaled across grades, VAMs cannot rely on gain scores and must instead incorporate prior year test scores in much the same way as any other predictor. This is satisfactory, but to the extent that prior year scores and currentyear scores measure different constructs, accuracy will suffer. Note that the equal interval scale assumption is important whether or not a vertical scale is assumed. 5 Teachers of highperforming students typically earn above average valueadded scores, but anecdotal reports suggest that teachers of gifted and talented classrooms may be penalized because tests are too easy to measure their students progress (AmreinBeardsley & Collins, 2012). 8
11 The Logic of ValueAdded Models In examining the logic of VAMs, it is helpful to begin by considering briefly what is wrong with evaluating teachers just by comparing their students average test scores at the end of the year. Seeing the obvious flaw in that approach should help to clarify the problem the VAM has to solve. Let us begin with something familiar. Think about a typical testing situation, where each student gets a test consisting of a collection of items. The student answers the items; the answers are scored; the item scores are summed to get a test score; and finally, different students test scores are compared to see who ranks high and who ranks low. Next, apply this template to the problem of measuring teacher effectiveness. The comparison is shown in Table 1. This time, think about a testing situation where the examinees are teachers, not students, and where each test item, if you will, is actually a student. The way these studentitems are administered to the teacherexaminees is by having the teacher teach the student for a year. The way the studentitems are scored is by giving each student an achievement test at the end of the year. The way the studentitem scores are summarized is by averaging the students test scores within each classroom. Then, the teachers are compared to one another based on these averages. Now, one can see right away that this is not going to work very well because some teachers will get students who are easier to teach or who know more at the start of the year compared to other teachers. If the group of students in a teacher s classroom for a year is like a test for that teacher, then one might say that some teachers are handed much easier tests, and others are handed much harder tests. So to make the teacher comparisons fairer, one has to adjust for these student differences. This is done by estimating what score each student would have earned, on average, if that student had been taught all year by any other teacher. Then, by comparing the student s actual endofyear score to this estimated score average across all possible teachers, one can adjust for those differences in the students assigned to different teachers. The starting premise is that each student spent the year being taught by one particular teacher. The endofyear scores that would have been observed if that student had instead been taught by some other teacher are each referred to as Table 1 Test Scores for Students Versus ValueAdded Model (VAM) Scores for Teachers Aspect of testing situation Typical test Simplified teacher VAM Examinees Students Teachers Items Test questions Students Test Items in a test form Students in a classroom Administration Student answers items Teacher teaches students Item scoring Item responses scored according to key Student learning scored by giving each student a standardized test Test score Sum of item scores Average of student test scores 9
12 counterfactuals the hypothesized outcomes of events that did not actually happen. The student s average score across all these potential, counterfactual teacher assignments is used as the point of comparison for judging the actual score obtained after the student has spent the year with a particular teacher. Once these counterfactual scores are estimated for each student, one can see whether each student actually performed as well as, better than, or worse than predicted. The estimated average score is subtracted from the observed score, so that a positive difference means better than expected; a negative difference means worse than expected. Then these differences are averaged up to the teacher level. 6 So how does one estimate how well a given student would have scored after spending the year in some other teacher s classroom? One looks for students similar to that given student and assumes that the average observed score for those other students, obtained after their respective years of instruction with various other teachers, gives a good estimate of the average counterfactual score for the given student. Various kinds of information about students can be used in deciding what similar means here. Now this is not the way VAMs are typically described. In practice, to carry out this process of estimating average counterfactual scores for each student, one makes strong statistical assumptions about the functional form of the relationships among various observable student characteristics and achievement test scores whether relationships between variables are best characterized as linear, for example, or with some more complicated mathematical function. Then a technique called regression analysis is used to carry out the estimation for all students at once. The process is often described in the convenient shorthand of controlling for or adjusting for various factors. That language is perfectly fine, of course, but may make it too easy to ignore the underlying logic of the estimation and the strong assumptions the regression model actually entails. Some big differences among various VAMs stem from their choices as to what information to use in controlling or adjusting for student differences. Prior year test scores are included, because these are among the most powerful predictors of currentyear test scores. Students who scored high last year are likely, on average, to score high again this year. Of course, just looking at last year s test score is not enough. VAMs that reach back further in time, including test scores from 2 years earlier as well as from the previous year, are considerably more accurate. Some models just use prior scores from the same subject area, while others pull in test scores from different subject areas. In addition to test scores, some models use students absences, suspensions, grade retentions, English learner or special education status, or summer school attendance. Some models may include gender or other demographic variables describing students. Models may include the average scores of other students in the same classroom or the average score for the entire school or district. All of these choices influence the resulting estimates of how well each individual student would have fared, averaging across all possible teacher assignments. 6 There is an additional technicality in this averaging, which is not of concern here. Because teachers with fewer students are more likely to get extreme valueadded estimates just by chance, some models adjust for the amount of information available about each teacher using socalled shrinkage estimators to make extreme scores less likely. This is another modeling decision that influences the outcomes. Different models give different answers. 10
13 Briggs and Domingue (2011) reanalyzed the data used to generate the teacher effectiveness estimates published by the Los Angeles Times in August of Here is what they said about the statistical model used in the analyses published by the LA Times: The term valueadded is intended to have the same meaning as the term causal effect that is, to speak of estimating the valueadded by a teacher is to speak of estimating the causal effect of that teacher. But once stripped of the Greek symbols and statistical jargon, what we have left is a remarkably simple model that we will refer to as the LAVAM (Los Angeles ValueAdded Model). It is a model which, in essence, claims that once we take into account five pieces of information about a student, the student s assignment to any teacher in any grade and year can be regarded as occurring at random. If that claim is accurate, the remaining differences can be said to be the value added or subtracted by that particular teacher. (Briggs & Domingue, 2011, p. 4) The 5 pieces of information in the LAVAM were test performance in the previous year, gender, English language proficiency, eligibility for Title I services, and whether the student began schooling in the LA Unified School District after kindergarten. In effect, the LAVAM relies on these 5 variables to account for all the systematic differences among the students assigned to different teachers. My point here is not that this particular model is a bad one because it only includes 5 variables, although Briggs and Domingue (2011) did show that teacher rankings changed substantially when an alternative model with some additional control variables was used. (They interpreted their findings as showing that the alternative model had less bias.) The key point here is to understand how VAMs work: They adjust for some set of student characteristics, and sometimes for certain classroom or school characteristics, and then assume that once those adjustments are made, student assignments to teachers are as good as random. Stated a little differently, the goal for the VAM is to strip away just those student differences that are outside of the current teacher s control those things the teacher should not be held accountable for, leaving just those student test score influences the teacher is able to control and therefore should be held accountable for. This is a sensitive business, and different, defensible choices can lead to substantial differences in teachers valueadded rankings. Earlier I offered an analogy of teacher valueadded estimation being like a testing process, in which the teachers are the examinees and the classrooms full of students are like collections of items on different forms of a test. Before leaving that analogy, let me also point out that in any testing situation, common notions of fairness require that all examinees take the test under the same testing conditions. Unlike standardized testing conditions, in the VAM scenario the teacherexaminees may be working under far from equal conditions as they complete their valueadded tests by teaching their students for a year. School climate and resources, teacher peer support, and, of course, the additional instructional support and encouragement students receive both out of school and from other school staff all make the test of teaching much easier for teachers in some schools and harder in others. 11
14 Statistical Assumptions VAMs are complicated, but not nearly so complicated as the reality they are intended to represent. Any feasible VAM must rely on simplifying assumptions, and violations of these assumptions may increase bias or reduce precision of the model s valueadded estimates. Violations of model assumptions also make it more difficult to quantify just how accurate or inaccurate those estimates really are. Hence, these statistical assumptions matter. Effects of Social Stratification Recall that the fundamental challenge is to estimate the average of each student s potential scores across all possible teachers. This is difficult due in part to the socioeconomic stratification in the U.S. school system. Reardon and Raudenbush (2009) pointed out that, given the reality of school segregation on the basis of various demographic characteristics of students, including family socioeconomic background, ethnicity, linguistic background, and prior achievement in practice, some students [may] have no access to certain schools (p. 494). If teachers in some schools have virtually no access to highachieving students from affluent families, and teachers in other schools have similarly limited access to lowachieving students from poor families, then the statistical model is forced to project well beyond the available data in order to estimate potential scores on a common scale for each student with each teacher. For this reason, VAM estimates are least trustworthy when they are used to compare teachers working in very different schools or with very different student populations. Peer Effects Another key assumption holds that a given student s outcome with a given teacher does not depend upon which other students are assigned to that same teacher. This is sometimes stated as no peer effects. 7 One usually thinks about peer effects as arising when students interact with each other. There are peer effects when small groups of students work collaboratively, for example. Or, peer effects are thought of as arising through peer culture whether students reinforce or discourage one another s academic efforts. These kinds of effects are important, of course, but for valueadded modeling, there are two additional kinds of peer effects that may be equally or more important. The first of these has to do with how the members of the class collectively influence the teacher s pacing of instruction, the level at which explanations are pitched, the amount of reading assigned, and so forth. If the teacher is meeting the students where they are, then the average achievement level in the class as a whole is going to influence the amount of content delivered to all of the students over the course of the school year. In the real world of schooling, students are sorted by background and achievement through patterns of residential segregation, and they may also be grouped or tracked within schools. Ignoring this fact is likely to result in penalizing teachers of lowperforming students and favoring teachers of highperforming students, just because the teachers of lowperforming students cannot go as fast. 7 Technically, no peer effects is an implication of the stable unit treatment value assumption (SUTVA). 12
15 Yet another kind of peer effect arises when some students in the classroom directly promote or disrupt the learning of others. Just about every teacher can recall some classes where the chemistry was right perhaps one or two strong students always seemed to ask just the right question at just the right time to move the classroom discussion along. Most teachers can also recall some classes where things did not go so well. Perhaps one or two students were highly disruptive or repeatedly pulled the classroom discussion off topic, wasting precious minutes before the teacher could get the lesson back on track. 8 Simply put, the net result of these peer effects is that VAMs will not simply reward or penalize teachers according to how well or poorly they teach. The will also reward or penalize teachers according to which students they teach and which schools they teach in. Some of these peer effects (e.g., disruptive students) may add random noise to VAM estimates. Others (e.g., effect of average achievement level on pacing) may introduce bias. 9 Adjusting for individual students prior test scores and other background characteristics may mitigate but cannot eliminate this problem. 8 It is, of course, the teacher s responsibility to manage disruptive students, but the fact remains that teacher time spent dealing with such classroom disruptions may affect the learning of all students in the classroom. 9 Some, but not all, VAMs incorporate classroom or schoollevel measures to help control for these kinds of systematic effects. 13
16 An Interpretive Argument for ValueAdded Model (VAM) Teacher Effectiveness Estimates I suggested earlier that one might think of teacher valueadded effectiveness estimates as a complicated broad steps in the interpretive argument (Kane, 2006, p. of Educational Measurement. His analysis laid out four kind of test score. Teachers are the examinees; each 34). My application of this framework to teacher VAM student is like a test item. Assigning a classroom full of score estimation is shown in Table 2. students to a teacher for a year is like giving the teacher The first step is scoring. Here the scoring proposition holds that teacher VAM scores accurately capture a test composed of 30 or so items. Thinking about the entire series of steps involved in valueadded estimation each teacher s effectiveness, with the particular group as a single, complicated measurement process, one can of students that teacher actually taught, as measured consider the validity of VAM scores for any given purpose in much the same way as a testing expert would by the student achievement test actually administered. In other words, each teacher s VAM score captures that consider the validity of any other score. An interpretive teacher s degree of success in imparting the knowledge argument is needed a logical sequence of propositions that, taken together, make the case for the proposed and skills measured by the student achievement test, reasonably undistorted by irrelevant factors. Scoring is use or interpretation. Then, once there is an interpretive the step from the teacher s classroom performance to the argument, the strength of the evidence supporting each teacher s VAM score. proposition must be considered. The second step is generalization, which addresses Perhaps the most authoritative contemporary treatment of test validation is provided by Michael Kane s test reliability. One needs to know how stable VAM scores would be across different possible classes a (2006) chapter, Validation, in the most recent edition Table 2 An Interpretive Argument for Teacher ValueAdded Model (VAM) Scores Stage of interpretive argument Description Focusing question 1. Scoring Observed score 2. Generalization Observed score to universe score 3. Extrapolation Universe score to target score 4. Implication Target score to interpretation or decision Construction of observed VAM score for an individual teacher Generalization to scores that might have been obtained with a different group of students or a parallel form of the same test Extrapolation to teacher effectiveness more broadly construed Soundness of the intended decision or interpretation Is the score unbiased? (i.e., is systematic error acceptably small?) Is the score reliable? (i.e., is random error acceptably small?) Do scores correlate with other kinds of indicators of teaching quality? Do teacher rankings depend heavily on the particular test used? Does achievement test content fully capture valued learning outcomes? How do VAM scores relate to valued nontest (noncognitive) outcomes? Are intended benefits likely to be realized? Have plausible unintended consequences been considered? 14
17 teacher might have taught and also over time. If this year s VAM score gives poor guidance as to a teacher s likely effectiveness next year, then it is not very useful. In the language of test theory, this is the step from the observed score to the universe score the longrun average across imagined repeated measurements. The third step is extrapolation, which directs attention to the relation between the student achievement test actually used and other tests that might have been used instead for capturing student learning outcomes. It also covers the broader question of how well students scores on this test or similar tests can capture the full range of important schooling outcomes. The real target of any measurement is some quality that is broader than test taking per se. In Kane s (2006) terminology, extrapolation is the move from the universe score to that target score. Finally, the fourth step is implication, which directs attention to rationales for the expected benefits of each particular score use or interpretation, as well as plausible unintended consequences. This is the step from the target score to some decision or verbal description. Let us next turn to some of the evidence concerning each of these steps. Scoring will address issues of bias, or systematic error. Generalization will address reliability, or random error. Extrapolation will address the relation between teacher VAM scores and other measures of effectiveness.10 Implication, finally, will take up the question of appropriate and inappropriate uses of VAM scores and their likely consequences. Scoring Recall that the scoring step holds that a teacher s VAM estimate really does tell how effective that teacher was, this year, with these students, in teaching the content measured by this particular achievement test. This means the scoring must be free of systematic bias, the statistical model must reflect reality, and the data must fit the model. The word bias is used in a statistical sense, although here the commonsense meaning of the term is not too far off. Bias refers to errors that do not average out as more information is collected. If teachers in some kinds of schools, or working with some kinds of students, or teaching in certain grades or subject areas tend to get systematically lower or higher VAM estimates, that kind of error will not average out in the long run. The error will tend to show up again and again for a given teacher, in the same direction, year after year, simply because teachers tend to work with similar students year after year, typically in the same or similar schools. Let us consider this question of bias. Jesse Rothstein (2010) published an important paper in which he developed and applied a falsification test for each of three different VAM specifications. Rothstein argued that it is logically impossible for current teacher assignments to influence students test score gains in earlier years. This year s teacher cannot influence last year s achievement. Therefore, if a VAM is run backward in time, using current teacher assignments to predict students score gains in earlier years, it ought to show that the true variance of prior year teacher effects, discounting random error, is near zero. This is called a falsification test because if the analysis does estimate substantial variance for prior 10 Another important dimension of extrapolation is related to the assumption that a teacher s effectiveness with one sort of students is predictive of that teacher s effectiveness with different sorts of students. The assumption that a teacher has some effectiveness independent of the kinds of students that teacher is working with is important, but largely unexamined. 15
18 year teacher effects, then those estimates have to be biased. Such a finding strongly suggests that currentyear teacher effect estimates may also be biased, although it does not prove the existence of bias. 11 Rothstein (2010) tried this out using data from fifth grade classrooms in North Carolina. His sample included more than 60,000 students in more than 3,000 classrooms in 868 schools. He tried several different VAMs and consistently found that fifth grade teacher assignments showed powerful effects on third to fourth grade test score gains. Briggs and Domingue (2011) used Rothstein s test to look at the data on teachers from the LA Unified School District the same data set Richard Buddin used to estimate the first round of teacher valueadded scores published by the Los Angeles Times in August On the reading test, they found that teachers estimated effects on their students gains during a previous school year were about as large as their estimated effects on score gains during the current year. On a mathematics test, the logically impossible prior year effects came out around two thirds as large as for the current year. In one comparison, the estimated effects of fourth grade teachers on third grade reading gains were slightly larger than those teachers estimated effects on fourth grade reading gains. Similar findings have emerged in other studies. How can this be? As stated earlier, one reason is the massively nonrandom grouping of students, both within and between schools, as a function of family socioeconomic background and other factors. This clearly has the potential to distort teacher effectiveness estimates coming out of VAMs. Nonrandom assignment might also take the form of assigning struggling readers to reading specialists or English learners to bilingual teachers. Bias is also possible due to differences in the schools where teachers work. Not all schools are equally conducive to student learning. Bias may come about because peer effects are not fully accounted for. Some limited evidence suggests that bias in VAMs may not be a serious problem (e.g., Chetty et al., 2011; Kane, McCaffrey, Miller, & Staiger, 2013). However, like all studies, each of these has some weaknesses and limitations. 12 Moreover, the fact that no bias is detected in one VAM application is no guarantee that bias may not exist in some other setting. Another significant concern arises because the student achievement tests often used to date have been those mandated by NCLB (2002), which by law are limited to testing content at grade level. That means that teachers of gifted and talented classes may be unable to earn high valueadded scores because their above grade level students are topping out on the tests and simply cannot demonstrate any further score gains. Likewise, teachers whose students are far below grade level may be penalized because the content they are teaching to meet their students needs does not show up on the tests used to measure student growth. Yet another potential source of bias is related to summer learning loss (see Figure 3). Jennifer Sloan Mc 11 Goldhaber and Chaplin (2012) analyzed the conditions under which it is possible for one of Rothstein s specifications to yield a nonnull finding even if currentyear effect estimates are unbiased and called for further investigation. Chetty et al. (2011) implemented a quasiexperimental test for selection on unobservables, based on teacher switching between schools, and also concluded that, although they replicated Rothstein s results, this does not in fact imply that their estimates of longterm teacher effects are biased. 12 The Chetty et al. (2011) study relied on student test data collected under relatively lowstakes conditions, which limits its applicability to VAMs with high stakes for teachers. The MET Project randomization study by Kane et al. (2013) examined random student assignment under rather constrained conditions and also suffered from problems of attrition and noncompliance. These problems limited its power to detect bias due to student assignment. 16
19 Combs and her colleagues at the RAND Corporation (McCombs et al., 2011) recently reviewed the research on summer learning loss. They concluded that on average, elementary school students lose about 1 month of learning over the summer months, from spring to fall. Losses are somewhat larger for mathematics, somewhat smaller for reading. But more importantly, these losses are not the same for all students. On average, students from higher income families actually post gains in reading achievement over the summer months, while their peers from lower income families post losses. This suggests a potential distortion in comparisons of VAM estimates among teachers whose students come from different economic backgrounds. On average, reading scores from the previous spring will underestimate the initial autumn proficiency of students in more advantaged classrooms and overestimate the initial autumn proficiency of those in less advantaged classrooms. Even if the two groups of students in fact make equal falltospring gains, their measured prior springtospring gains may differ. Some of this difference may be accounted for in VAMs that include adjustments for demographic factors, but once again, it appears likely that valueadded estimates may be biased in favor of some teachers and against others. Figure 3 Summer Learning Loss Is Not the Same for Students From Less Affluent Versus More Affluent Families Measured Springto Spring test score gain Spring to Fall (summer) loss or gain = + Low income families: Summer learning loss Spring to spring gain understates school year gain High income families: Summer learning gain in reading Spring to spring gain overstates school year gain These concerns must be balanced against compelling empirical evidence that teacher VAM scores are capturing some important elements of teaching quality. In particular, Chetty et al. (2011) recently reported that teachers VAM scores predicted their students future college attendance, earnings, socioeconomic status, and even teenage pregnancy rates. 13 Their study included creative statistical tests for bias due to omitted variables, and they found no bias. Similarly, Goldhaber and Hansen (2010) have reported modest but statistically significant effects of teacher VAM estimates on student test scores several years later. Teacher VAM scores are certainly not just random noise. These models appear to capture important differences in teachers effects on student learning outcomes. But even the best models are not pure measures of teacher effectiveness. VAM scores do predict important student learning outcomes, but my reading of the evidence strongly suggests that these scores nonetheless measure not only how well teachers teach, but also whom and where they teach. Fall to Spring (school year) gain 13 The study by Chetty et al. (2011) is very carefully done, but relied on data collected in a context in which no particularly high stakes were attached to student test scores. Even in that context, the authors set aside the top 2% of teacher VAM scores because these teachers impacts on test scores appear suspiciously consistent with testing irregularities indicative of cheating (Chetty et al., 2011, p. 23). When these teachers were included in the analysis, estimated longterm teacher effects were reduced by roughly 20% to 40%. 17
20 Generalization The second link in the chain of propositions needed to support VAM scores is generalization, the step from observed score to universe score. The first proposition, scoring, focused on the question of what valueadded scores were measuring, including the question of whether those scores were free from Next Year systematic Distribution bias. Generalization shifts attention from Year s what Bottom to how well Quintile and from 25 of One 30 systematic error to random Elementary error. It focuses Teachers, on the question of how stable or unstable Next Year Florida in Five teacher Counties Distribution VAM scores turn of One out to be. This is the familiar Year s issue Bottom of score Quintile reliability. 250 Bottom Top Elementary Teachers, in Five Quintile Quintile One very good way to Florida estimate Counties reliability is just to correlate valueadded scores from two points in time, or 40 5 NextYear 35 0 Distribution of One Year s Top Quintile from two sections of the same Next Year class. The Distribution correlation of itself is the same as a reliability One Elementary 30 Bottom Teachers, 2 in 53Florida Counties 4 Top Year s coefficient. Top Quintile 25 Quintile Quintile Several years Elementary Teachers, in Five 15 ago, Daniel McCaffrey and his coauthors investigated 40 Florida Counties a variety of VAM specifications Next Year and data Distribution sets and found of One yeartoyear correlations mostly Year s between Top Quintile 25.2 and.4, with 20 Bottom Top Elementary Teachers, in Five 15 Quintile Quintile a few lower and a few higher (McCaffrey, Sass, Lockwood, & Mihaly, 2009). More specifically, they looked Florida Counties 10 5 at valueadded scores for teachers in five different counties in Florida. Figure 4 illustrates some of their findings for elementary school teachers. They found that in each county, a minimum of 10% of the teachers in the bottom fifth of the distribution one year were in the top fifth the next year, and conversely. Typically, only about a third of 1 year s top performers were in the top category again the following year, and likewise, only about a third of 1 year s lowest performers were in the lowest category again the following year. These findings are typical. A few studies have found reliabilities around.5 or a little higher (e.g., Koedel & Betts, 2007), but this still says that only half the variation in these valueadded estimates is signal, and the remainder is noise. Figure 4 YeartoYear Changes in Teacher ValueAdded Rankings Reported by McCaffrey et al. (2009, Table 4, p. 591) Percent Percent Percent Percent NextYear Distribution of One Year s Bottom Quintile Elementary Teachers, in 5 Florida Counties Bottom Quintile Top Quintile Dade Duval Hillsborough Orange Palm DadeBeach Duval Hillsborough Orange Palm Beach Dade Duval Hillsborough Orange Palm DadeBeach Duval Hillsborough Orange Palm Beach McCaffrey and his colleagues (2009) pointed out that yeartoyear changes in teachers scores reflected both the vagaries of student sampling and actual changes in teachers effectiveness from year to year. But if one wants to know how useful one year s score is for predicting the next year s score, that distinction does not matter. McCaffrey et al. s results imply that unstable or random components together account for more than half the variability in VAM scores, and in some cases as much as 80% or more. Sorting teachers according to single year valueadded scores is sorting mostly on noise. 18
Stability of School Building Accountability Scores and Gains. CSE Technical Report 561. Robert L. Linn CRESST/University of Colorado at Boulder
Stability of School Building Accountability Scores and Gains CSE Technical Report 561 Robert L. Linn CRESST/University of Colorado at Boulder Carolyn Haug University of Colorado at Boulder April 2002 Center
More informationAmerican Statistical Association
American Statistical Association Promoting the Practice and Profession of Statistics ASA Statement on Using ValueAdded Models for Educational Assessment April 8, 2014 Executive Summary Many states and
More informationUsing Value Added Models to Evaluate Teacher Preparation Programs
Using Value Added Models to Evaluate Teacher Preparation Programs White Paper Prepared by the ValueAdded Task Force at the Request of University Dean Gerardo Gonzalez November 2011 Task Force Members:
More informationValueAdded Measures of Educator Performance: Clearing Away the Smoke and Mirrors
ValueAdded Measures of Educator Performance: Clearing Away the Smoke and Mirrors (Book forthcoming, Harvard Educ. Press, February, 2011) Douglas N. Harris Associate Professor of Educational Policy and
More informationPRACTICE BOOK MATHEMATICS TEST (RESCALED) Graduate Record Examinations. This practice book contains. Become familiar with
This book is provided FREE with test registration by the Graduate Record Examinations Board. Graduate Record Examinations This practice book contains one actual fulllength GRE Mathematics Test (Rescaled)
More information62 EDUCATION NEXT / WINTER 2013 educationnext.org
Pictured is Bertram Generlette, formerly principal at Piney Branch Elementary in Takoma Park, Maryland, and now principal at Montgomery Knolls Elementary School in Silver Spring, Maryland. 62 EDUCATION
More informationACT Research Explains New ACT Test Writing Scores and Their Relationship to Other Test Scores
ACT Research Explains New ACT Test Writing Scores and Their Relationship to Other Test Scores Wayne J. Camara, Dongmei Li, Deborah J. Harris, Benjamin Andrews, Qing Yi, and Yong He ACT Research Explains
More informationPRACTICE BOOK COMPUTER SCIENCE TEST. Graduate Record Examinations. This practice book contains. Become familiar with. Visit GRE Online at www.gre.
This book is provided FREE with test registration by the Graduate Record Examinations Board. Graduate Record Examinations This practice book contains one actual fulllength GRE Computer Science Test testtaking
More informationDOCUMENT REVIEWED: AUTHOR: PUBLISHER/THINK TANK: DOCUMENT RELEASE DATE: September 2009 REVIEW DATE: November 12, 2009 REVIEWER: EMAIL ADDRESS:
DOCUMENT REVIEWED: AUTHOR: PUBLISHER/THINK TANK: DOCUMENT RELEASE DATE: September 2009 REVIEW DATE: November 12, 2009 REVIEWER: EMAIL ADDRESS: How New York City s Charter Schools Affect Achievement. Caroline
More informationA STUDY OF WHETHER HAVING A PROFESSIONAL STAFF WITH ADVANCED DEGREES INCREASES STUDENT ACHIEVEMENT MEGAN M. MOSSER. Submitted to
Advanced Degrees and Student Achievement1 Running Head: Advanced Degrees and Student Achievement A STUDY OF WHETHER HAVING A PROFESSIONAL STAFF WITH ADVANCED DEGREES INCREASES STUDENT ACHIEVEMENT By MEGAN
More informationThe Mystery of Good Teaching by DAN GOLDHABER
Page: 1 The Mystery of Good Teaching by DAN GOLDHABER Who should be recruited to fill the two to three million K 12 teaching positions projected to come open during the next decade? What kinds of knowledge
More informationChapter 5, Learning to Think
From Derek Bok, Our Underachieving Colleges: A Candid Look At How Much Students Learn and Why They Should Be Learning More, Princeton, New Jersey: Princeton University Press, 2006. Chapter 5, Learning
More informationCHARTER SCHOOL PERFORMANCE IN PENNSYLVANIA. credo.stanford.edu
CHARTER SCHOOL PERFORMANCE IN PENNSYLVANIA credo.stanford.edu April 2011 TABLE OF CONTENTS INTRODUCTION... 3 DISTRIBUTION OF CHARTER SCHOOL PERFORMANCE IN PENNSYLVANIA... 7 CHARTER SCHOOL IMPACT BY DELIVERY
More informationWriting the Empirical Social Science Research Paper: A Guide for the Perplexed. Josh Pasek. University of Michigan.
Writing the Empirical Social Science Research Paper: A Guide for the Perplexed Josh Pasek University of Michigan January 24, 2012 Correspondence about this manuscript should be addressed to Josh Pasek,
More informationFlorida s Plan to Ensure Equitable Access to Excellent Educators. heralded Florida for being number two in the nation for AP participation, a dramatic
Florida s Plan to Ensure Equitable Access to Excellent Educators Introduction Florida s record on educational excellence and equity over the last fifteen years speaks for itself. In the 10 th Annual AP
More informationInformation and Employee Evaluation: Evidence from a Randomized Intervention in Public Schools. Jonah E. Rockoff 1 Columbia Business School
Preliminary Draft, Please do not cite or circulate without authors permission Information and Employee Evaluation: Evidence from a Randomized Intervention in Public Schools Jonah E. Rockoff 1 Columbia
More informationFixedEffect Versus RandomEffects Models
CHAPTER 13 FixedEffect Versus RandomEffects Models Introduction Definition of a summary effect Estimating the summary effect Extreme effect size in a large study or a small study Confidence interval
More informationfeature fill the two to three million K 12 teaching recruits have? These are the questions confronting policymakers as a generation
feature The evidence shows that good teachers make a clear difference in student achievement. The problem is that we don t really know what makes A GOOD TEACHER WHO SHOULD BE RECRUITED TO traditional,
More informationGlossary of Terms Ability Accommodation Adjusted validity/reliability coefficient Alternate forms Analysis of work Assessment Battery Bias
Glossary of Terms Ability A defined domain of cognitive, perceptual, psychomotor, or physical functioning. Accommodation A change in the content, format, and/or administration of a selection procedure
More informationEPI BRIEFING PAPER EMBARGOED UNTIL 12:01 AM, AUGUST 29, 2010 PROBLEMS WITH THE USE OF STUDENT TEST SCORES TO EVALUATE TEACHERS
EPI BRIEFING PAPER ECON OMI C POLI CY IN STI TUTE A U G UST 29, 2010 BRIEFING PAPER #278 EMBARGOED UNTIL 12:01 AM, AUGUST 29, 2010 PROBLEMS WITH THE USE OF STUDENT TEST SCORES TO EVALUATE TEACHERS COAUTHORED
More informationTeacher ValueAdded and Credentials as Tools for School Improvement. Douglas N. Harris University of Wisconsin at Madison
Teacher ValueAdded and Credentials as Tools for School Improvement Douglas N. Harris University of Wisconsin at Madison Summary Discuss the logic of the teacher credential and valueadded strategies Discuss
More informationConsumer s Guide to Research on STEM Education. March 2012. Iris R. Weiss
Consumer s Guide to Research on STEM Education March 2012 Iris R. Weiss Consumer s Guide to Research on STEM Education was prepared with support from the National Science Foundation under grant number
More informationThe Teaching Gap Best Ideas from the World s Teachers for Improving Education in the Classroom
The Teaching Gap Best Ideas from the World s Teachers for Improving Education in the Classroom James Stigler and James Hiebert The Free Press, 1999 Improvement Through a Focus on Teaching School learning
More informationTeacher ValueAdded and Comparisons with Other Measures of Teacher Effectiveness. Summary
Teacher ValueAdded and Comparisons with Other Measures of Teacher Effectiveness Douglas N. Harris University of Wisconsin at Madison Summary Discuss the logic of the teacher valueadded added and credential
More informationCHARTER SCHOOL PERFORMANCE IN INDIANA. credo.stanford.edu
CHARTER SCHOOL PERFORMANCE IN INDIANA credo.stanford.edu March 2011 TABLE OF CONTENTS INTRODUCTION... 3 CHARTER SCHOOL IMPACT BY STUDENTS YEARS OF ENROLLMENT AND AGE OF SCHOOL... 6 DISTRIBUTION OF CHARTER
More informationThe Crisis in Education Research Capacity
The Crisis in Education Research Capacity Larry V. Hedges Northwestern University Presented at the Annual Meeting of the Midwest Educational Research Association Evanston, IL November 9, 2012 Disclaimer
More informationWill Teacher ValueAdded Scores Change when Accountability Tests Change?
Will Teacher ValueAdded Scores Change when Accountability Tests Change? Daniel F. McCaffrey Educational Testing Service Carnegie Knowledge Network Webinar August 14, 2013 Copyright c 2013 by Educational
More informationTechnical Report. Teach for America Teachers Contribution to Student Achievement in Louisiana in Grades 49: 20042005 to 20062007
Page 1 of 16 Technical Report Teach for America Teachers Contribution to Student Achievement in Louisiana in Grades 49: 20042005 to 20062007 George H. Noell, Ph.D. Department of Psychology Louisiana
More informationFaculty Productivity and Costs at The University of Texas at Austin
Faculty Productivity and Costs at The University of Texas at Austin A Preliminary Analysis Richard Vedder Christopher Matgouranis Jonathan Robe Center for College Affordability and Productivity A Policy
More informationCONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE
1 2 CONTENTS OF DAY 2 I. More Precise Definition of Simple Random Sample 3 Connection with independent random variables 3 Problems with small populations 8 II. Why Random Sampling is Important 9 A myth,
More informationTeacher Effects: What Do We Know?
Teacher Effects: What Do We Know? Helen F. Ladd The availability of administrative data on teachers and students has greatly enhanced the ability of researchers to address research topics related to the
More informationChapter 5: Analysis of The National Education Longitudinal Study (NELS:88)
Chapter 5: Analysis of The National Education Longitudinal Study (NELS:88) Introduction The National Educational Longitudinal Survey (NELS:88) followed students from 8 th grade in 1988 to 10 th grade in
More informationRecipes for Rational Government from the Independent Women s Forum. Carrie Lukas, Managing Director, Independent Women s Forum
Policy Focus Alternative Teacher Certification Recipes for Rational Government from the Independent Women s Forum Carrie Lukas, Managing Director, Independent Women s Forum April 2011 Volume 1, Number
More informationBasic Concepts in Research and Data Analysis
Basic Concepts in Research and Data Analysis Introduction: A Common Language for Researchers...2 Steps to Follow When Conducting Research...3 The Research Question... 3 The Hypothesis... 4 Defining the
More informationWhen reviewing the literature on
Erik Cooper Tutoring Center Effectiveness: The Effect of DropIn Tutoring While tutoring as a whole has been demonstrated to improve student learning across a variety of subjects and age groups, there
More informationStrategies for Promoting Gatekeeper Course Success Among Students Needing Remediation: Research Report for the Virginia Community College System
Strategies for Promoting Gatekeeper Course Success Among Students Needing Remediation: Research Report for the Virginia Community College System Josipa Roksa Davis Jenkins Shanna Smith Jaggars Matthew
More informationTeacher preparation program student performance data models: Six core design principles
Teacher preparation program student performance models: Six core design principles Just as the evaluation of teachers is evolving into a multifaceted assessment, so too is the evaluation of teacher preparation
More informationSchools Valueadded Information System Technical Manual
Schools Valueadded Information System Technical Manual Quality Assurance & Schoolbased Support Division Education Bureau 2015 Contents Unit 1 Overview... 1 Unit 2 The Concept of VA... 2 Unit 3 Control
More informationCALIFORNIA S TEACHING PERFORMANCE EXPECTATIONS (TPE)
CALIFORNIA S TEACHING PERFORMANCE EXPECTATIONS (TPE) The Teaching Performance Expectations describe the set of knowledge, skills, and abilities that California expects of each candidate for a Multiple
More informationAustralians get fail mark on what works to improve schools
REVOLUTION SCHOOL Summary of Survey and Research Australians get fail mark on what works to improve schools A significant number of Australians wrongly believe that smaller class sizes, compulsory homework
More informationGetting the Most from Demographics: Things to Consider for Powerful Market Analysis
Getting the Most from Demographics: Things to Consider for Powerful Market Analysis Charles J. Schwartz Principal, Intelligent Analytical Services Demographic analysis has become a fact of life in market
More informationTitle: Transforming a traditional lecturebased course to online and hybrid models of learning
Title: Transforming a traditional lecturebased course to online and hybrid models of learning Author: Susan Marshall, Lecturer, Psychology Department, Dole Human Development Center, University of Kansas.
More informationAbstract Title: Identifying and measuring factors related to student learning: the promise and pitfalls of teacher instructional logs
Abstract Title: Identifying and measuring factors related to student learning: the promise and pitfalls of teacher instructional logs MSP Project Name: Assessing Teacher Learning About Science Teaching
More informationSection 7: The FiveStep Process for Accommodations for English Language Learners (ELLs)
: The FiveStep Process for Accommodations for English Language Learners (ELLs) Step 1: Setting Expectations Expect English Language Learners (ELLs) to Achieve Gradelevel Academic Content Standards Federal
More informationNCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
More informationThis year, for the first time, every state is required
What New AYP Information Tells Us About Schools, States, and Public Education By Daria Hall, Ross Wiener, and Kevin Carey The Education Trust This year, for the first time, every state is required to identify
More informationA Guide to Curriculum Development: Purposes, Practices, Procedures
A Guide to Curriculum Development: Purposes, Practices, Procedures The purpose of this guide is to provide some general instructions to school districts as staff begin to develop or revise their curriculum
More informationTOOL KIT for RESIDENT EDUCATOR and MENT OR MOVES
Get to Know My RE Observe Collect Evidence Mentor Moments Reflect Review Respond Tailor Support Provide Provide specific feedback specific Feedback What does my RE need? Practice Habits Of Mind Share Data
More informationHow to Write a Successful PhD Dissertation Proposal
How to Write a Successful PhD Dissertation Proposal Before considering the "how", we should probably spend a few minutes on the "why." The obvious things certainly apply; i.e.: 1. to develop a roadmap
More informationBy Mary Rose. Why Use Rubrics?
Make Room for Rubrics Rubrics are receiving high marks from teachers around the country. Here s how to use these scoring devices for authentic assessment. By Mary Rose Exactly how can teachers determine
More informationModeling customer retention
PMRS Imprints Archives Publishing Date: May 1998. 1998. All rights reserved. Copyright rests with the author. No part of this article may be reproduced without written permission from the author. Customer
More informationSESRI Policy & Program Evaluation Workshop. Doha, Qatar January 1922, 2015
SESRI Policy & Program Evaluation Workshop Doha, Qatar January 1922, 2015 Outline: Session 5 Measurement: Benchmarking in Evaluation Design Quasiexperimental research designs: Combining evaluation with
More informationValidity, Fairness, and Testing
Validity, Fairness, and Testing Michael Kane Educational Testing Service Conference on Conversations on Validity Around the World Teachers College, New York March 2012 Unpublished Work Copyright 2010 by
More informationAbstract Title Page Not included in page count.
Abstract Title Page Not included in page count. Title: The Impact of The Stock Market Game on Financial Literacy and Mathematics Achievement: Results from a National Randomized Controlled Trial. Author(s):
More informationYEAR 3 REPORT: EVOLUTION OF PERFORMANCE MANAGEMENT ALBANY NY CHARTER SCHOO CHARTER SCHOOL PERFORMANCE IN NEW YORK CITY. credo.stanford.
YEAR 3 REPORT: EVOLUTION OF PERFORMANCE MANAGEMENT CHARTER SCHOOL PERFORMANCE IN NEW YORK CITY IN credo.stanford.edu ALBANY NY CHARTER SCHOO January 2010 SUMMARY This report supplements the CREDO National
More information2. Starting Pointsnaive (strawperson) analyses, spurious associations?
Data Analysis Memo: Teacher Credentials and Student Progress: What do the data say? David Rogosa Stanford University February 2002 [updated December 2002] 1. Available California Data (used in these analyses)
More informationDuring my many years as a classroom teacher and then as a
01Creighton (Schools).qxd 6/1/2006 5:48 PM Page 1 CHAPTER ONE The Role of Data Analysis in the Lives of School Leaders During my many years as a classroom teacher and then as a principal and superintendent
More informationMath Placement Acceleration Initiative at the City College of San Francisco Developed with San Francisco Unified School District
Youth Data Archive Issue Brief October 2012 Math Placement Acceleration Initiative at the City College of San Francisco Developed with San Francisco Unified School District Betsy Williams Background This
More informationChallenges High School Teachers Face Casey Langer Tesfaye & Susan White
www.aip.org/statistics One Physics Ellipse College Park, MD 20740 301.209.3070 stats@aip.org April 2012 Challenges High School Teachers Face Casey Langer Tesfaye & Susan White R E P O R T S O N H I G H
More informationTEACHERS HELP KIDS LEARN.
TEACHERS HELP KIDS LEARN. BUT TEACHERS ALSO NEED HELP TO DO A GOOD JOB. THIS IS THE STORY OF A TEACHER WHO TRIES HARD BUT DOESN T GET THE HELP SHE NEEDS, AND HOW THAT HURTS HER KIDS. TEACHER QUALITY: WHAT
More informationTitle: Research on the efficacy of master s degrees for teachers Date: April 2014
Title: Research on the efficacy of master s degrees for teachers Date: April 2014 Question: >> Do teachers who have master s degrees support better outcomes for students than their counterparts without
More information2013 New Jersey Alternate Proficiency Assessment. Executive Summary
2013 New Jersey Alternate Proficiency Assessment Executive Summary The New Jersey Alternate Proficiency Assessment (APA) is a portfolio assessment designed to measure progress toward achieving New Jersey
More informationPolynomials and Factoring. Unit Lesson Plan
Polynomials and Factoring Unit Lesson Plan By: David Harris University of North Carolina Chapel Hill Math 410 Dr. Thomas, M D. 2 Abstract This paper will discuss, and give, lesson plans for all the topics
More informationAnnual National Assessment (ANA): A SADTU perspective
Annual National Assessment (ANA): A SADTU perspective Executive summary The Annual National Assessment (ANA) is an assessment instrument introduced by the Department of Basic Education (DBE) in 2011 to
More informationYEAR 3 REPORT: EVOLUTION OF PERFORMANCE MANAGEMENT ALBANY NY CHARTER SCHOO CHARTER SCHOOL PERFORMANCE IN FLORIDA. credo.stanford.edu.
YEAR 3 REPORT: EVOLUTION OF PERFORMANCE MANAGEMENT CHARTER SCHOOL PERFORMANCE IN FLORIDA IN credo.stanford.edu ALBANY NY CHARTER SCHOO June 2009 INTRODUCTION This report supplements the CREDO National
More informationBenchmark Assessment in StandardsBased Education:
Research Paper Benchmark Assessment in : The Galileo K12 Online Educational Management System by John Richard Bergan, Ph.D. John Robert Bergan, Ph.D. and Christine Guerrera Burnham, Ph.D. Submitted by:
More informationResponse to Critiques of Mortgage Discrimination and FHA Loan Performance
A Response to Comments Response to Critiques of Mortgage Discrimination and FHA Loan Performance James A. Berkovec Glenn B. Canner Stuart A. Gabriel Timothy H. Hannan Abstract This response discusses the
More informationWorthy Alternatives (Figure 1)
Worthy Alternatives (Figure 1) Attending a charter high school rather than a traditional high school in Chicago and Florida is associated with a higher likelihood of students graduating and going on to
More informationAre School Level Supports for Teachers and Teacher Collegiality Related to Other School Climate Characteristics and Student Academic Performance?
S3 Factsheet Are School Level Supports for Teachers and Teacher Collegiality Related to Other School Climate Characteristics and Student Academic Performance? Effective learning conditions for students
More informationAn overview of ValueAdded Assessment
An overview of ValueAdded Assessment Ted Hershberg Director, Operation Public Education Professor, Public Policy and History University of Pennsylvania Without much thought, we answer: Good schools are
More informationHigh School Algebra Reasoning with Equations and Inequalities Solve equations and inequalities in one variable.
Performance Assessment Task Quadratic (2009) Grade 9 The task challenges a student to demonstrate an understanding of quadratic functions in various forms. A student must make sense of the meaning of relations
More informationSTUDENT S TIME MANAGEMENT AT THE UNDERGRADUATE LEVEL Timothy W. Johnson
STUDENT S TIME MANAGEMENT AT THE UNDERGRADUATE LEVEL Timothy W. Johnson This paper was completed and submitted in partial fulfillment of the Master Teacher Program, a 2year faculty professional development
More informationALTERNATE ACHIEVEMENT STANDARDS FOR STUDENTS WITH THE MOST SIGNIFICANT COGNITIVE DISABILITIES. NonRegulatory Guidance
ALTERNATE ACHIEVEMENT STANDARDS FOR STUDENTS WITH THE MOST SIGNIFICANT COGNITIVE DISABILITIES NonRegulatory Guidance August 2005 Alternate Achievement Standards for Students with the Most Significant
More informationDoes empirical evidence clearly show that more experienced teachers with higher degrees are better? YES
Does empirical evidence clearly show that more experienced teachers with higher degrees are better? YES Overview Background Models with inconclusive effects Ehrenberg & Brewer Monk & King Limitations of
More informationHow Do Teacher ValueAdded Measures Compare to Other Measures of Teacher Effectiveness?
How Do Teacher ValueAdded Measures Compare to Other Measures of Teacher Effectiveness? Douglas N. Harris Associate Professor of Economics University Endowed Chair in Public Education Carnegie Foundation
More informationMeasurement and Metrics Fundamentals. SE 350 Software Process & Product Quality
Measurement and Metrics Fundamentals Lecture Objectives Provide some basic concepts of metrics Quality attribute metrics and measurements Reliability, validity, error Correlation and causation Discuss
More informationThe Opportunity Cost of Study Abroad Programs: An EconomicsBased Analysis
Frontiers: The Interdisciplinary Journal of Study Abroad The Opportunity Cost of Study Abroad Programs: An EconomicsBased Analysis George Heitmann Muhlenberg College I. Introduction Most colleges and
More informationMathematics Placement And Student Success: The Transition From High School To College Mathematics
Mathematics Placement And Student Success: The Transition From High School To College Mathematics David Boyles, Chris Frayer, Leonida Ljumanovic, and James Swenson University of WisconsinPlatteville Abstract
More informationThe Virginia Reading Assessment: A Case Study in Review
The Virginia Reading Assessment: A Case Study in Review Thomas A. Elliott When you attend a conference organized around the theme of alignment, you begin to realize how complex this seemingly simple concept
More informationENT? SCHOOLING, STATI S, AND POVERTY: CAN WE Y: CAN WE MEASURE SC OL IMPROVEMENT? SCH
LING, STATISTICS, AND PO LING, STATISTICS, AND PO EASURE SCHOOL IMPRO ENT? SCHOOLING, STATI NG, STATISTICS, AND POV S, AND POVERTY: CAN WE Y: CAN WE MEASURE SC ASURE SCHOOL IMPROV OL IMPROVEMENT? SCH LING,
More informationCenter for Effective Organizations
Center for Effective Organizations REWARD PRACTICES AND PERFORMANCE MANAGEMENT SYSTEM EFFECTIVENESS CEO PUBLICATION G 0321 (449) EDWARD E. LAWLER III Center for Effective Organizations Marshall School
More informationMARZANO SCHOOL LEADERSHIP EVALUATION MODEL
TEACHER & LEADERSHIP EVALUATION MARZANO SCHOOL LEADERSHIP EVALUATION MODEL Prepared by Learning Sciences Marzano Center Center for Teacher and Leadership Evaluation April 2012 1 TEACHER & LEADERSHIP EVALUATION
More informationPrinciples to Actions
Principles to Actions Executive Summary In 1989 the National Council of Teachers of Mathematics (NCTM) launched the standardsbased education movement in North America with the release of Curriculum and
More informationADOPTING SCHOOLING POLICIES THAT RECOGNIZE THE IMPORTANT DIFFERENCES THAT EXIST AMONG TEACHERS
ADOPTING SCHOOLING POLICIES THAT RECOGNIZE THE IMPORTANT DIFFERENCES THAT EXIST AMONG TEACHERS Dan Goldhaber Center for Education Data & Research University of Washington Bothell 2 Context: New Recognition
More informationThe Effect of Tenure on Teacher Performance in Secondary Education
The Effect of Tenure on Teacher Performance in Secondary Education Elizabeth Phillips Policy Analysis and Management Honors Thesis Submitted May 2009 Advised by Professor Jordan Matsudaira Acknowledgements
More informationWORKING PAPEr 22. By Elias Walsh and Eric Isenberg. How Does a ValueAdded Model Compare to the Colorado Growth Model?
WORKING PAPEr 22 By Elias Walsh and Eric Isenberg How Does a ValueAdded Model Compare to the Colorado Growth Model? October 2013 Abstract We compare teacher evaluation scores from a typical valueadded
More informationChap 3 CAPM, Arbitrage, and Linear Factor Models
Chap 3 CAPM, Arbitrage, and Linear Factor Models 1 Asset Pricing Model a logical extension of portfolio selection theory is to consider the equilibrium asset pricing consequences of investors individually
More informationStandards of Quality and Effectiveness for Professional Teacher Preparation Programs APPENDIX A
APPENDIX A Teaching Performance Expectations A. MAKING SUBJECT MATTER COMPREHENSIBLE TO STUDENTS TPE 1: Specific Pedagogical Skills for Subject Matter Instruction Background Information: TPE 1. TPE 1 is
More informationNATIONAL COMPETENCYBASED TEACHER STANDARDS (NCBTS) A PROFESSIONAL DEVELOPMENT GUIDE FOR FILIPINO TEACHERS
NATIONAL COMPETENCYBASED TEACHER STANDARDS (NCBTS) A PROFESSIONAL DEVELOPMENT GUIDE FOR FILIPINO TEACHERS September 2006 2 NATIONAL COMPETENCY BASED TEACHER STANDARDS CONTENTS General Introduction to
More informationRELEVANT TO ACCA QUALIFICATION PAPER P3. Studying Paper P3? Performance objectives 7, 8 and 9 are relevant to this exam
RELEVANT TO ACCA QUALIFICATION PAPER P3 Studying Paper P3? Performance objectives 7, 8 and 9 are relevant to this exam Business forecasting and strategic planning Quantitative data has always been supplied
More informationWOULD ACCOUNTABILITY BASED ON TEACHER VALUEADDED BE SMART POLICY? AN EXAMINATION OF THE STATISTICAL PROPERTIES
WOULD ACCOUNTABILITY BASED ON TEACHER VALUEADDED BE SMART POLICY? AN EXAMINATION OF THE STATISTICAL PROPERTIES AND POLICY ALTERNATIVES Douglas N. Harris University of Wisconsin at Madison June 19, 2008
More informationTeacher Compensation and the Promotion of Highly Effective Teaching
Teacher Compensation and the Promotion of Highly Effective Teaching Dr. Kevin Bastian, Senior Research Associate Education Policy Initiative at Carolina, UNC Chapel Hill Introduction Teachers have sizable
More informationHigher Performing High Schools
COLLEGE READINESS A First Look at Higher Performing High Schools School Qualities that Educators Believe Contribute Most to College and Career Readiness 2012 by ACT, Inc. All rights reserved. A First Look
More informationIntroduction: Online school report cards are not new in North Carolina. The North Carolina Department of Public Instruction (NCDPI) has been
Introduction: Online school report cards are not new in North Carolina. The North Carolina Department of Public Instruction (NCDPI) has been reporting ABCs results since 199697. In 2001, the state General
More informationThe 11 Components of a BestInClass 360 Assessment
White Paper LEADERSHIP DEVELOPMENT The 11 Components of a BestInClass 360 Assessment Crucial elements for your 360 assessment 360degree assessments are the backbone of most corporations leadership development
More informationLocal outlier detection in data forensics: data mining approach to flag unusual schools
Local outlier detection in data forensics: data mining approach to flag unusual schools Mayuko Simon Data Recognition Corporation Paper presented at the 2012 Conference on Statistical Detection of Potential
More informationThe Alignment of Common Core and ACT s College and Career Readiness System. June 2010
The Alignment of Common Core and ACT s College and Career Readiness System June 2010 ACT is an independent, notforprofit organization that provides assessment, research, information, and program management
More informationPromotion and reassignment in public school districts: How do schools respond to differences in teacher effectiveness?
Promotion and reassignment in public school districts: How do schools respond to differences in teacher effectiveness? The Harvard community has made this article openly available. Please share how this
More informationShould noncognitive skills be included in school accountability systems? Preliminary evidence from California s CORE districts
Evidence Speaks Reports, Vol 1, #13 March 17, 2016 Should noncognitive skills be included in school accountability systems? Preliminary evidence from California s CORE districts Martin R. West Executive
More informationDEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9
DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9 Analysis of covariance and multiple regression So far in this course,
More information