On the instability of teacher effectiveness measures, by Morgan Polikoff
One of the most important policy innovations of the last few years has been the adoption and implementation of new multiple-measure teacher evaluation systems. These systems, encouraged by the Obama administration’s Race to the Top and No Child Left Behind Waiver programs, use measures of student learning alongside other measures of teachers’ classroom performance to make formative and summative judgments about individual teachers.
By far the most controversial portion of these systems has been the student achievement portion. Value-added models (VAMs), which use students’ prior achievement history and, sometimes, demographic characteristics to estimate teachers’ impact on student achievement, have been criticized along a number of dimensions. The most fundamental objection has been that VAMs are unstable from year to year; this instability, it is argued, all but invalidates their potential use for high stakes evaluation.
While the (in)stability of VAMs is well known by now, due to the decades of research on VAMs, there is much less known about the technical properties of the classroom observation and student survey portions of new evaluations. Presumably, should the stability results look roughly similar (year-to-year correlations in the .2 to .5 range), the same validity concerns should apply. My paper in the American Journal of Education, The Stability of Observational and Student Survey Measures of Teaching Effectiveness, uses data from the Bill and Melinda Gates Foundation’s Measures of Effective Teaching study to investigate this issue, looking at the year-to-year stability of several well known and widely-used observational and student survey measures (the Framework for Teaching, the Classroom Assessment Scoring System, the Protocol for Language Arts Teaching Observations, the Mathematical Quality of Instruction instrument, and the Tripod student survey).
The results show that the year-to-year stability at the total score level for these observational and student survey measures is only slightly better than that of VAMs – typically in the .45 to .55 range. When subscales are examined—important because subscale scores are likely more useful to teachers from a formative standpoint—the results are weaker still.
Next, I sought to investigate the extent to which instability varied based on the characteristics of teachers or classrooms. I did find limited evidence that instability on certain instruments might be more of an issue for teachers in elementary schools than in middle schools. However, I found no evidence that instability was explained by year-to-year variations in the characteristics of students.
Finally, to help make sense of these findings, I presented the year-to-year reclassification rates for each of the studied instruments. Reclassification was studied using two approaches—a norm-referenced approach, where teachers were sorted into quintiles and followed into the next year, and a criterion-referenced approach, where teachers were classified as above or below some performance cut in year one and followed into year two. This reclassification analysis was illuminating, revealing that reclassification was more of a problem when using the criterion-referenced approach (as is common in new state systems), but that high- and low-performing teachers in one year using a norm-referenced approach were relatively unlikely to be rated the opposite in another year.
Overall, whether these are results are concerning or heartening probably depends on where you stand with regard to the evidence we already have. For those who view the instability of VAMs as a fatal flaw limiting their utility for high- or low-stakes decisions about teachers, these results suggest that the same concerns may apply to observations and student survey measures. That is, unless one imagines that the cutoff for “too unstable to be useful” lies between .5 and .55, these results suggest that the instability concerns that some have with regard to VAMs likely also apply to these observational and student survey measures.
Alternatively, for those who argue that VAMs provide useful evidence about teacher effectiveness and that the year-to-year stability concerns are overblown, these results suggest that observational and student survey components also appear to be capturing some stable element of teacher effectiveness. This agrees with findings presented in the main Measures of Effective Teaching reports, which showed that the stability of a composite of measures was greater than the stability of the individual component measures.
While these results provide some of the first large-scale evidence on the stability of the non-VAM components of new teacher evaluation systems, they are limited in several ways, and their limitations demand further investigation. The most important limitation is that the data used here come from a research study, not from an actual system implemented in a district or state. As data are collected from new evaluation systems, it is imperative that districts and states engage in this kind of analysis, in order to the understand the properties of new systems. This is true not only for stability, but also for issues of bias (another claim commonly leveled against VAMs that may well apply to observational and student survey measures). Simply put, more evidence is needed from real-life implementation of these systems.
The research also points for the need to come to established consensus about whether instability of measures of teaching performance is a problem or not, and what level of stability is needed to make either high- or low-stakes decisions about teachers. Perhaps it is the case that none of the measures will have the technical properties desired by opponents of new evaluation systems. Or perhaps it is the case that all of the measures provide useful information and they can be used thoughtfully to help improve teaching and learning in U.S. schools.
Morgan S. Polikoff is an assistant professor of education at the University of Southern California Rossier School of Education. He researches the design, implementation, and effects of standards, assessment, and accountability policies.
Value-added measures (VAMs) are considered inaccurate, unstable and unreliable for critics. Research shows, however, that differences in teacher effectiveness as measured by value-added of teachers are associated with very large economic effects on students. Teachers can control how they learn in classroom but are unable to control their family backgrounds, motivation and learning interest. Thus, the result of VAMs are unable to indicate how much progress the student makes because students from higher income family background are more likely to have well-educated parents who are able to help them with homework at home. Also, this measure is not the best method to measure teacher’s ability. Since learning is not a linear activity and every child has different cognitive development and in most part, how the students learn depends on their grade level.
Regarding to these limitations about the valued-added measures, which also the author pointed out, such kind of instability also exists in other approaches that evaluate teaching efficiency. For classroom observational measure, the instability comes into play when the teachers have been informed in advance that some administrators would come into the classroom and observe their instruction. In this way, the teachers would make a plan beforehand which probably involves more student interaction that than they usually have in class or includes more interesting content in class which does not necessary consistently relating to their syllabus. However, some people would say if this happens, multiple classroom observations should take place in random circumstances in a sense that teachers are doing what they supposed to do rather than a well-planned “showcase” class. In fact, instability would happen in this approach as well. For instance, in classrooms of students from low SES backgrounds, student-teacher interaction would take place less compared to their classrooms in schools of high-income districts. Classrooms in lower income school districts include less high-level questioning, which require teachers to facilitate discussion or conduct small-group learning. Since the students have less sufficient prior knowledge pertaining to the content they are learning, so the teachers tend to use more low-level questioning strategies such as modeling and demonstrating.
According to the results of the study in this article, the year-to-year stability at the total score level for these observational and student survey measures is only slightly better than that of VAMs. However, the instability is still a limitation for these types of teacher evaluation approaches. Research and practice is both needed to investigate a best teacher evaluation model that remains stable, accurate and objective. Personally, the decisions about teachers should be based on diversified measures of teaching effectiveness. Also, evaluation from principals and administrators should be valued, as well as the feedback from students, parents and peer teachers. Teaching efficiency panels held by the school should better involves voices from all these actors in education and the decision-making process should come out as the result of the discussion of at least three representative from each groups mentioned.