Read the whole thing @Shanker BlogMuch of the criticism of value-added (VA) focuses on systematic bias, such as that stemming from non-random classroom assignment (also here). But the truth is that most of the imprecision of value-added estimates stems from random error. Months ago, I lamented the fact that most states and districts incorporating value-added estimates into their teacher evaluations were not making any effort to account for this error. Everyone knows that there is a great deal of imprecision in value-added ratings, but few policymakers seem to realize that there are relatively easy ways to mitigate the problem.
This is the height of foolishness. Policy is details. The manner in which one uses value-added estimates is just as important – perhaps even more so – than the properties of the models themselves. By ignoring error when incorporating these estimates into evaluation systems, policymakers virtually guarantee that most teachers will receive incorrect ratings. Let me explain.
Each teacher’s value-added estimate has an error margin (e.g., plus or minus X points). Just like a political poll, this error margin tells us the range within which that teacher’s “real” effect (which we cannot know for certain) falls. Unlike political polls, which rely on large random samples to get accurate estimates, VA error margins tend to be gigantic. One concrete example is from New York City, where the average margin of error was plus or minus 30 percentile points. This means that a New York City teacher with a rating at the 60th percentile might “actually” be anywhere between the 30th and 90th percentiles. We cannot even say with confidence whether this teacher is above or below average.
[Update! I forgot the following paragraph!!]
Now, here’s the problem: In virtually every new evaluation system that incorporates a value-added model, the teachers whose scores are not significantly different from the average are being treated as if they are. For example, some new systems sort teachers by their value-added scores, and place them into categories – e.g., the top 25 percent are “highly effective,” the next 25 percent are “effective,” the next 25 percent are “needs improvement,” and the bottom 25 percent are “ineffective.”