In the first three posts of this series (see Diane Ravitch’s summary of them here), I argue that Matt DiCarlo’s retrospective analysis of market-driven reform research in 2013, “The Year in Research on Market-Based Education Reforms: 2013,” gives short shrift to practical realities. DiCarlo complains, however, that the Gates Foundation’s Measures of Effective Teaching (MET) study was “designed to provide guidance as to how states should design new teacher evaluation systems,” but “unfortunately, many states and districts have already finalized” their systems. He then chides the MET’s final press release for its “bold statement that the project ‘demonstrated that it is possible to identify great teaching.’”
I agree with DiCarlo’s implicit protest against mandating the establishment of high-stakes evaluations before research was conducted on how those systems should be designed. On his second criticism, however, I don’t disagree with the phrase in the press release. It is not hard to identify great and good teaching. Measuring great and good teaching is a different matter.
Conversely, identifying bad teaching is usually easy. Firing bad teachers isn’t that hard. The big problem is that so many school administrators have such an overwhelming work load that they don’t tackle the unpleasant task of dismissing those teachers. Moreover, the problem is worse in the inner city where principals doubt that qualified replacements can be found.
The problem is that the Gates Foundation and the “Billionaires Boys Club” helped coerce states into adopting a system where they had to try to measure effective and ineffective instruction, and attach stakes to those flawed metrics. As the MET released more evidence, the impossibility for entire systems to measure the effectiveness of all teachers should have been obvious.
DiCarlo rightfully praises the thoroughness of the MET’s technical survey by Kata Mihaly, Daniel F. McCaffrey, Douglas O. Staiger, and J. R. Lockwood. This paper is professional, but unlike the work by McCaffrey cited in my second post, it starts with the assumptions of the MET and merely explicates its methods. One of its assumptions is that districts will employ value-added in a manner that is professionally and morally impeccable. Actually, they write, “we assume that states and districts are interested in a target criterion, a unvaried quantity of interest that if known would be the preferred value for making the decisions about teachers that the composite performance measure will be used to support.”
Come on, how many state and district leaders know what that sentence means, much less grasp the problems with value-added? Many or most, undoubtedly, are primarily trying to stay out of trouble until the value-added tidal wave recedes.
Mihaly, McCaffrey, Staiger, and Lockwood also assume elementary school classes of twenty students and middle schools teachers who have 20 students in each of four classroom sections. They make such an implausible assumption because “Twenty is the median number of students per section used to estimate student achievement VA on the classroom rosters collected by the MET project, and four sections represent about the average number of sections we have found in data.”
I wonder how many inner city middle school teachers the Gates Foundation could find who only teach eighty students.
This assumption presumably explains their finding that elementary scores have low reliability in comparison to middle school, which contradicts findings of other value-added studies discussed in the previous posts. Presumably, they mean that statement applies to the middle school sample linked to the MET methodolgy, as opposed to the real world.
And, that leads to the biggest problem with the MET sample. Even as value-added was being imposed on high-poverty schools across the nation, the MET used a sample of students who were only 56% low income, with only 8% on special education IEPs and where only 13% were English Language Learners to determine whether it could be valid for high-challenge schools.
The big finding by Mihaly, McCaffrey, Staiger, and Lockwood was that each of the MET’s evaluation measures “captures some distinct unique dimension of effective teaching.” They conclude:
Simply put, indicators of state value added are the best predictors of a teacher’s impact on state test scores, classroom observation scores are the best predictors of a teacher’s classroom practice, and student surveys are the best predictors of student perceptions of the teacher.
“In addition,” they report, “our analysis was performed in a low-stakes environment (although some of the districts participating in our study were transitioning to high stakes for value added measures). Results may differ in high-stakes conditions where teachers would have more incentives to distort the individual indicators.”
Of course, there is a huge difference between testing multiple measures for capturing the great variety of effective teaching and inventing a statistical model that captures each dimension in a single high-stakes score.
“The Reliability of Classroom Evaluations by School Personnel” by Tom Kane and Andrew Ho, also found that most teachers’ value-added performance was clustered in the middle, that multiple evaluators were necessary to ensure fairness, and administrators ranked teachers higher than peer reviewers.
Regarding the last point, didn’t they know that teachers have a history of being tougher evaluators on each other than administrators? How could they have expected that the MET experiment would lead to systems as rigorous and cost effective as PAR peer review evaluations conducted by teachers’ unions?
DiCarlo cites a second study, “Have We Identified Effective Teachers?” by Kane, McCaffrey, Trey Miller, and Douglas O. Staiger, which probes the hugely important question of whether “sorting” can be addressed by value-added. It admits up front that, as a practical matter, students and teachers could not be randomly assigned to a different school site. “Our study does not allow us to investigate the validity of the measures of effectiveness for gauging differences across schools,” they report. “The process of student sorting across schools could be different than sorting between classrooms in the same school,” but “… Unfortunately, our evidence does not inform the between-school comparisons in any way.”
Kane et. al seem to have once assumed that a different type of sorting would not occur and that all states and districts would be intellectually honest in addressing the distortions caused by various types of peer pressure in diverse parts of our segregated society. They implicitly criticize Florida’s politicization of their statistical models. Contrary to implementing value-added in the most scholarly manner, that state “has gone so far as to create regulations forbidding the use of poverty status and race in value-added models for teacher effects.”
Who wouldn’t be “shocked, shocked” that politics could be found in the Florida government?
And, that leads to the MET’s ultimate finding:
Overall, our findings suggest that existing measures of teacher effectiveness provide important and useful information on the causal effects that teachers have on their students’ outcomes. No information is perfect, but better information should lead to better personnel decisions and better feedback to teachers.
As I will explain in subsequent posts, would any rational person, with any real-world knowledge of schools, have gone down the value-added road to improve schools, as opposed to dominating educators, if they knew that the MET would have reached such an underwhelming conclusion?