Value-Added Measures of Educator Performance: Clearing Away the Smoke and Mirrors

Value-Added Measures of Educator Performance: Clearing Away the Smoke and Mirrors (Book forthcoming, Harvard Educ. Press, February, 2011) Douglas N. Harris Associate Professor of Educational Policy and Public Affairs University of Wisconsin at Madison October 19, 2010 SERVE Southeast REL Webinar

Preview Discuss how we measure (or really fail to measure) teacher performance today Explain what value-added measures are and how they might improve performance measurement Discuss how well value-added measures capture teacher performance the different types of errors Interpret research evidence about the errors Provide a sense of perspective, as well as some specific recommendations, about how to use valueadded measures

A Question for All Organizations How should we measure and reward performance? What if we only measure performance related to one organizational goal and omit other goals? What happens if we measure performance badly for any or all goals? How do we align the incentives of workers with those of the organizations using imperfect measures? Specific concerns in schools: Many goals to balance Need for professionalism Desire to keep politics out

Rationale for Value-Added

The Traditional Credentials Strategy to Teacher Quality Until 1990s, the education system focused on rule compliance and resources finance, class size,... Teacher credentials also fall within the resources, or input approach Undergraduate Education and Test Scores Graduate Education and Experience Certification Unfortunately, the only one related to teacher effectiveness is experience Therefore important to consider outcomes and instructional practice as alternatives

Formal Teacher Evaluations Do these make up for weaknesses of credentials? Evaluations do not focus on the technical core of teaching - i.e., they ignore instructional practice 90% of teachers receive the highest rating Principals often do not have the training or the time to be instructional leaders Partly because low stakes of evaluations give little reason to take evaluation seriously I almost never hear teachers or administrators say that the formal evaluation works well

Teacher Effectiveness Varies New research suggests that teacher effectiveness varies a great deal, even within individual schools Some even argue that we eliminate the achievement gap simply by reassigning the most effective teachers to minority children Differences are exaggerated but the larger conclusion about variation is not really in dispute Also, consistent with the evidence on credentials if credentials worked, we would see less variation Yet, we measure teacher effectiveness poorly and accountability focuses on whole schools

A Failure of Test-Based Accountability: The Snapshot Problem Snapshot = Any measure of student outcomes at a single point in time Regardless of test reporting methods (% proficient, scale scores, etc.) Until now, all accountability has been on snapshots The Problem: Students enter the classroom at very different starting points, because of factors outside the control of the school The starting gate inequality Why is this a problem?

Cardinal Rule of Accountability Rule: Hold people accountable for what they can control Part 1: Hold people accountable... Meaning that accountability is important Part 2:... for what they can control. Meaning that the details matter Accountability systems have failed to follow Cardinal Rule because snapshot fails to account for what students bring to the classroom

Consequences Driving teachers out of low-snapshot schools Pushing low-snapshot students out the door Complacency in high-snapshot schools Value-added measures can help address the snapshot problem and reduce these consequences

Questions About the Rationale for Value-Added

What are Value-Added Measures (VAM)?

Basic VAM If the problem is accounting for what students bring with them to the classroom, then measure what they bring Annual student testing allows researchers to subtract prior scores from current ones growth Growth can be calculated for different test score reporting methods Scale scores and NCEs best Ideal: Growth of individual students based on scale scores The paradigm shift

Illustration of Basic Approach: 2 Teachers w/ Same VA Achievement Ms. Erickson: High Snapshot Mr. Hacker: Low Snapshot Start of School Year End of Year Starting Gate Inequality

Illustration #2: 2 Teachers w/ Different Value-Added Achievement Ms. Smith: Low VA, but High Snapshot Ms. Bloom: High VA, but Low Snapshot Start of School Year End of Year

Limits of Basic VAM: Advanced VAM Unequal school resources Prior achievement may not be enough to account for student differences Possible solution: Compare similar schools Put them into buckets Apples to apples comparisons (within buckets) Teachers whose students make greater than predicted growth have high value-added

Illustration of Advanced VAM: A Simple Comparison Achievement Individual school growth High valueadded District growth, or similar schools Time 3 4 5 6 Grades

Illustration of Advanced VAM: Prediction Approach Achievement Individual school growth High valueadded Predicted growth Time 3 4 5 6 Grades

Illustration of Advanced VAM: Prediction Approach w/ Low-Value-Added Achievement Individual school growth High valueadded Low valueadded Time 3 4 5 6 Grades

Illustration of Advanced VAM: Prediction Approach with Controls Achievement Individual school growth High valueadded Predicted growth, small class sizes Predicted growth, large class sizes Time 3 4 5 6 Grades

How Exactly Does It Work? With each control variable included, VAMs account for the contribution of each factor to student achievement on the average, in all schools Based on these measured contributions, VAMs assign bonus points to schools with few school resources (and more disadvantaged students if demographics are included) If having 1 fewer student in class increases test scores by 2 points, then a school with 5 more students per class than avg. school gets 10 bonus points Each control variable added helps to make the schools in each bucket more and more similar in terms of what they can control

Controversy of Student Demographics Accounting for student demographics can be interpreted as lowering expectations for disadvantaged students In one sense, this is true: schools with fewer school resources and more disadvantaged students can achieve the same ratings as other schools with lower actual achievement gains In another sense, this is false: value-added does not provide schools with any incentive to give greater effort to disadvantaged students We can apply weights that give as much or as little weight to disadvantaged students as we wish

Value-Added Measures are Relative VA allows us to make comparisons among schools and teachers (it s relative), not draw absolute conclusions about performance On the one hand, this means that some teachers and schools will have low value-added no matter what they do On the other hand, we would never want to say that when a teacher or school gets to a particular standard, that they are good enough Relative measures facilitate continuous improvement

Questions About How Value-Added Measures are Created

Possible Errors in Value- Added Measures

Two Basic Types of Errors Systematic Error: The error is more likely to occur with a particular school or teacher Snapshots are a case in point: they systematically disadvantage low-snapshot schools Random Error: Is equally likely to arise for everyone Example: A coin toss Two sources: Measurement error (from the student test scores) Sampling error (more students, less sampling error) Random error is worse with growth measures

Illustrating Random Error in Growth Measures Achievement 3 rd grade score: 1100 Maximum Growth: +500 4 th grade score: 1400 Minimum growth: +100 Time

More on Errors Types of random errors Type I error = in this case, the probability of concluding two teachers perform differently when they are in fact the same ( statistical significance ) Type II error = the probability of concluding two teachers perform the same when they different Random and systematic errors are both important for deciding how to use performance measures

Statistical Errors and Decision Errors Random Errors Type One Error: Conclude two are different when the same Type Two Error: Conclude two are same when really the different?? Policy Decision Error One: Example: Give an award to someone who really isn t high-performing Decision Error Two: Example: Leave someone on the job who is performing poorly Systematic Errors

We made too many wrong mistakes -- Yogi Berra

Research on Strengths and Weaknesses of VAM

The Good News Research on VAM is in its infancy, but... Again, differences between the lowest and highest value-added teachers seem large VAM measures have been partly validated by a random assignment experiment (here in LA) VAM measures of teacher effectiveness are positively correlated with principals subjective assessments of teachers

The Bad News VA no better than the tests garbage in, garbage out Much effort right now toward improving the quality of student assessments Are imprecise Hard to say that one teacher is clearly better than another based on VAM-A As a result, teacher measures are unstable Vary across tests (same subject) Sensitive to specific statistical assumptions May not totally address the tracking problem

The Limited Applicability of VAM One of the main limitations of VAM is that, in most states, it can only be applied easily in grades 4-8, math and reading Excludes: Teachers in other subjects, coaches, specialists Teachers in grades K-3 and 9-12 New teachers On the other hand, it wouldn t make sense for teacher evaluations to be the same across all grades and subjects

Questions About the Strengths and Weaknesses of Value-Added Measures

Putting the Evidence in Perspective Researchers have strict standards about drawing conclusions based on statistics (about teacher performance or anything else) See AERA/APA/NCME standards As decision-makers, you do not have this luxury cannot wait around for ideal solutions, or accept large numbers of ineffective teachers remaining in classrooms All measures have their advantages and disadvantages and you have to compare them

The Double Standard Critics of VAM don t apply the same standard to credentials that they do to value-added Example: Do credentials converge with results from other ratings of quality, such as classroom observations, parent surveys,? Answer: No way. No performance measure could possibly meet the AERA/APA/NCME standards

When I hear somebody sigh, 'Life is hard,' I am always tempted to ask, 'Compared to what?' -- Sydney J. Harris (journalist)

Understanding Value-Added: The 3 Key Distinctions

Teacher vs. School Value-Added Teacher value-added is arguably more problematic than school value-added it is more subject to student tracking fewer students per teacher teachers aren t accustomed to substantive evaluation Trade-off between free-riding and accuracy A middle option: team value-added Elementary schools: grade levels teams Middle and high schools: subject matter teams

Formative vs. Summative VAM is inherently summative it does not provide much guidance on how to improve No measure can do both well Formative and summative measures are complementary Formative measures alone provide a path to improvement but perhaps not an incentive The credentialing problem Summative measures provide an incentive but no path

Low- vs. High-Stakes There aren t any no stakes uses Lowest stakes: School-level VA with school bonuses Medium stakes: Report teacher VA to school principal Performance pay Highest stakes: Make VA measures publicly available Tenure and dismissal

Recommendations: Using Value-Added to Improve Teaching and Learning

Recommendations for Using VAM #1: Use value-added to measure school performance and hold schools accountable #2: Experiment with and carefully evaluate policies that use value-added to measure the performance of individual teachers #3: In creating performance measures, combine valueadded with other measures more closely related to actual practice #4: Experiment with and carefully evaluate policies that use value-added to measure the performance of teacher teams

Recommendations: Part II #5a: Consider extending value-added to other grades, subjects, and student outcomes... #5b:... But don t let the tail wag the dog. #6: Avoid the air bag problem. Don t drive valueadded measures too fast.

Recommendations on Creating and Reporting VA Measures: Part I #1: Use student tests that reflect rich content and are standardized, scaled and criterion-referenced #2: Create data systems that link student outcomes over time and to teachers and schools #3: Include all students, including special education, English Language Learners, and students with some missing data #4: Make adjustments to align the timing of the test with the timing of schooling activities

Recommendations on Creating and Reporting VA Measures: Part II #5: Average value-added measures over 2 years #6: Create value-added measures based on comparisons among teachers and schools that facilitate cooperation and collaboration #7: Create value-added measures that compare teachers with grades and subjects #8: Account for factors that are outside the control of those being evaluated #9: Adjust for sampling error #10: Report confidence intervals

An Additional Recommendation Use value-added to evaluate school, district, and state programs and practices The evidence on teacher credentials (above) is a good example The value-added approach solves the same problem in program evaluation as it does in educator performance Avoid systematic errors

How Are Others Using VA? Most districts are following these recommendations Mixing value-added with classroom observations Dozens of districts are using value-added as a partial basis for merit pay (federal TIF prog) Revamping formal evaluation and tenure decisions For many, lack of good data system is the first barrier Some problems Moving too fast ( air bag problem) Lack of professional development about VA measures Dueling evaluation systems

All models are false but some models are useful. -- George E.P. Box

Conclusion: Moving Forward to Ensure Teacher Effectiveness in LAUSD We can do better than the credentialing and check list evaluation system In deciding how to use VAM, we should: (1) Ask ourselves: Is this system going to give high ratings to the types of teachers and schools I would want my children to attend? (2) Compare VAM to the alternatives We need a comprehensive system of teacher effectiveness and performance measures in some form represent one important element

Papers and References Policy brief from PACE (forthcoming) Forthcoming book on value-added from Harvard Education Press (February) My web site: http://www.education.wisc.edu/eps/faculty/harris.asp Web site focused on teacher quality research: http://www.teacherqualityresearch.org Ed Week Commentary (June, 2008) National Conference on Value-Added