Value-Added Teacher Evaluation Criteria Post-Race to the Top by Joe Elefante

Race to the Top (RTTT) was a competitive education grant program introduced by President Barack Obama’s administration in 2009. RTTT awarded large grants to states for satisfying certain criteria (U.S. Department of Education, 2009b). One such criterion was that states applying for the grant must “(d)esign and implement rigorous, transparent, and fair evaluation systems for teachers and principals that… take into account data on student growth… as a significant factor” (U.S. Department of Education, 2009b). Student growth, according to RTTT, was at least partly based on “a student’s score on the State’s assessments under the ESEA (Elementary and Secondary Education Act)” (U.S. Department of Education, 2009b). At $4.35 billion, RTTT was the largest competitive education grant program in United States history (U.S. Department of Education, 2009a).

What followed was an explosion of value-added models (VAM) in states’ teacher evaluation systems. VAMs use statistical methods to estimate individual teachers’ effects on their students’ growth on state-mandated assessments (Opper, 2009). From 2009-2014, the number of states that included VAMs in teacher evaluations grew from only 15 to 42 (Aldeman, 2017).

There is concern, however, that VAMs do not sufficiently capture the full scope of teachers’ value to their students (Everson et al., 2013). VAMs, for example, are not applicable to teachers who do not teach “tested” subjects – most commonly English language arts (ELA) and math. VAM models also do not consider that students often learn even tested subjects under multiple teachers, especially English language learners, students identified as gifted, and students with special needs (Hallinger et al., 2014). Studies show that student factors such as students’ abilities, health, and attendance, and sorting of students and teachers also affect teachers’ VAM scores (Dieterle et al., 2015; Hallinger et al., 2014; Hill et al., 2011; Schochet & Chiang, 2013). Further, some scholars have questioned both the statistical validity of VAM (Parsons et al., 2019) and the consistency across different models employed in different states (Atteberry & Mangan, 2020; Blazar et al., 2016; Goldhaber et al., 2013; Papay, 2011).

That said, valid VAMs are undoubtedly a useful metric when trying to assess certain teachers’ impacts on their students. Let us assume that we want to improve student learning in 3rd through 8th grade math and ELA. Students’ scores on state assessments, to some extent, represent an accurate metric of that learning. VAMs, in turn, give us useful information about the impact those teachers have on their students’ growth on those scores.

Of course, this information is only useful if then used to improve student learning in those subjects and grade levels. Winters & Cowen (2013), for example, found that student achievement suffered when working with teachers who would have been fired under existing evaluation systems based on their VAM scores. Valid VAM scores can give school leaders important data to identify, remove, and replace “ineffective” teachers.

VAMs – a Theory of Action

Kappler Hewitt and Amrein-Beardsely (2016) offer a compelling theory of action for how VAMs are intended to improve student learning. This theory suggests that teacher evaluation systems that incorporate VAMs prompt both the “voluntary and involuntary exit” of ineffective teachers, while motivating remaining others to improve their practice, thereby leading to improved student achievement (Kappler Hewitt & Amrein-Beardsley, 2016). This simple theory relies on several assumptions. One is that VAM scores accurately measure teacher quality. A second is that VAM score data will be used to inform personnel decisions. Yet another is that remaining teachers will in fact improve due to the accountability tied to VAMs (Kappler Hewitt & Amrein-Beardsley, 2016).

Fundamental to this premise is the belief that teachers have a considerable impact on students’ learning. (Kappler Hewitt & Amrein-Beardsley, 2016). Wright et al. (1997) discovered that teacher quality was the most significant factor in improving student achievement, above other factors – including class size. Rivkin et al. (2005) later established that teacher quality was more important than class size, school resources, and teacher education or experience when predicting student achievement growth. Students that have had teachers with higher VAM scores have not only enjoyed higher test scores (Chetty et al., 2014), but higher lifetime earnings (Chetty et al., 2013). Being that teacher employment decisions are well within the power of school leaders, improving teacher quality is increasingly viewed as one of the most practical means of improving student learning (Kappler Hewitt & Amrein-Beardsley, 2016). While the pre-RTTT literature is clear that VAMs are a worthwhile metric of teacher quality, 12 years later it is worth studying whether the massive growth in VAMs post-RTTT has resulted in improved teaching and learning.

Have VAMs Improved Teaching and Learning?

In a study currently in progress, I am looking at student achievement data from the Stanford Education Data Archive (Center for Education Policy Analysis, n.d.), or SEDA, to see if there are any differential outcomes in test scores in states that have implemented VAMs post-RTTT versus those that have not. My initial analysis includes four geographically and demographically similar states. Iowa (Iowa Department of Education, 2019) and Wisconsin (Wisconsin Department of Public Instruction, 2019) have not implemented VAMs in their teacher evaluation systems. Michigan (Michigan Department of Education, 2019) and Minnesota (Minnesota Department of Education, 2013) both employed value-added teacher evaluation criteria in the 2013-14 academic year.

Using a difference-in-differences analysis, I calculated whether there is a difference in the average scores by district, subject, and grade level in the five-year time period after implementation. The results show a statistically significant negative association between the implementation of value-added criteria in teacher evaluation and test scores in those four states. Those negative associations persist in both ELA and math and across all grade levels. Looking at a year-by-year comparison, however, this association becomes less significant over time. The negative association peaks in the second year after implementation but decreases steadily, eventually becoming non-existent in math and much smaller in ELA by year five (see Table 1).

Table 1.

Effects of VAM Implementation on Districts’ Mean Test Scores

 OverallELAMath
VAM-0.127***-0.170***-0.088**
 (0.025)(0.025)(0.028)
Observations152981522615295

Number of Years after Implementation

 OverallELAMath
Year 1-0.134***-0.182***-0.093***
 (0.023)(0.024)(0.026)
    
Year 2-0.252***-0.310***-0.195***
 (0.031)(0.032)(0.034)
    
Year 3-0.146***-0.197***-0.100**
 (0.032)(0.033)(0.035)
    
Year 4-0.073*-0.089*-0.058
 (0.035)(0.036)(0.039)
    
Year 5-0.029-0.070*0.006
 (0.035)(0.035)(0.040)
Observations152981522615295
Includes district and year fixed effects.

Granted, this study only looks at four states over five years, so there well may be other state-level factors at play. But these results align with research that suggests that focusing on standardized test scores may not be in the best long-term interest of students. For example, implementation of VAMs has had an effect on what curricular content teachers stress in the classroom, (Coburn et al., 2016; Hursh, 2007) and the amount of time devoted to preparing students and infrastructure for the tests (Hamilton et al., 2007; Plank & Condliffe, 2013; Supovitz, 2009), often to the detriment of a broader curriculum.

Further, the focus on test scores has spurred many teachers to shift their focus to “bubble” students, or students closest to the state-defined proficiency level, dedicating less of their energy to higher or lower achieving students (Bae, 2018; Coburn et al., 2016; Hamilton et al., 2007; Price, 2016). But many teachers fundamentally misunderstand the VAMs used in their evaluations (Jennings & Pallas, 2016). State evaluation systems use VAMs that measure student growth year-over-year, not the percentage of students that score “proficient” or better (Close et al., 2020). Those teachers may in fact be better off focusing on students with lower scores, where more growth is possible. Trying to “game” VAM calculations not only not in the best interest of the teachers themselves, but it does not promote academic growth for all students, which is the goal of VAMs.

There are other concerning unintended consequences of the proliferation of VAMs. For one, it discourages quality teachers from taking positions in high-needs districts (Johnson, 2015). There is also the issue of the “revolving door” (Ingersoll, 2001, p. 11) or high rates of teacher turnover prompted by the environment of high-stakes employment decisions based on VAMs. Unsurprisingly, this phenomenon is felt more keenly in urban schools serving low-income students of color (Johnson, 2015).

Ultimately, VAMs are only as useful as the employment decisions and improved teacher practice that result. If school leaders and teachers do not use VAM data to accomplish those goals, then VAMs are little more than added stress for teachers and paperwork for administrators and state education agencies. Of interest is how the negative association between VAM implementation and student achievement becomes less significant or non-existent year-over-year. It is quite possible that, over time, teachers focus less on the high stakes associated with the tests and more on better teaching, which takes time to show appreciable effects on student learning. In terms of personnel decisions, low VAM scores rarely lead to immediate termination. In Michigan, for example, it takes three years of ineffective evaluations before teachers can be terminated, and often their replacements are first-year teachers who require time and experience to realize their potential (Stegall, 2014).

Twelve years after RTTT and eight years after the proliferation of VAMs, a broad-based examination of VAMs in theory and practice is appropriate. Moving forward, researchers can undertake similar studies including even more states and academic years, although disruptions to test administration during the COVID-19 pandemic have made that task more difficult. But if VAMs are going to continue to inform instructional practice and personnel decisions, it is important to ensure they are functioning as intended.

Joseph Elefante is a PhD student in Educational Leadership Policy at Texas Tech University. He holds an M.A. in Educational Leadership from Montclair State University and a B.Mus. in Music Performance from New Jersey City University. His research interests primarily center on arts and whole child education advocacy and holistic methods of measuring student success, teacher evaluations, and school quality. He is also interested in investigating the long-term outcomes of arts and noncognitive skill development and developing methods of measuring quality arts and noncognitive education in school settings. In addition to his doctoral studies, Joe is currently the Supervisor of Fine & Performing Arts, Family & Consumer Science, and Technology Education for a mid-size urban school district in northeastern NJ. 

References

Aldeman, C. (2017). The teacher evaluation revamp, In hindsight. Education Next, 17(2). https://www.educationnext.org/the-teacher-evaluation-revamp-in-hindsight-obama-administration-reform

Atteberry, A., & Mangan, D. (2020). The sensitivity of teacher value-added scores to the use of fall or spring test scores. Educational Researcher, 49(5), 335–349. https://doi.org/10.3102/0013189X20922993

Bae, S. (2018). Redesigning systems of school accountability: A multiple measures approach to accountability and support. Education Policy Analysis Archives, 26(8). https://doi.org/10.14507/epaa.26.2920

Blazar, D., Litke, E., & Barmore, J. (2016). What does it mean to be ranked a “high” or “low” value-added teacher? Observing differences in instructional quality across districts. American Educational Research Journal, 53(2), 324–359. https://doi.org/10.3102/0002831216630407

Chetty, R., Friedman, J. N., & Rockoff, J. E. (2013). Measuring the impacts of teachers II: Teacher value-added and student outcomes in adulthood (Working Paper No. 19424; NBER Working Paper Series). National Bureau of Economic Research. http://www.nber.org/papers/w19424

Chetty, R., Friedman, J. N., & Rockoff, J. E. (2014). Measuring the impacts of teachers I: Evaluating bias in teacher value-added estimates. American Economic Review, 104(9), 2593–2632. https://doi.org/10.1257/aer.104.9.2593

Close, K., Amrein-Beardsley, A., & Collins, C. (2020). Putting teachers evaluation systems on the map: An overview of state’s teacher evaluation systems post–Every Student Succeeds Act. Education Policy Analysis Archives, 28, 58. https://doi.org/10.14507/epaa.28.5252

Coburn, C. E., Hill, H. C., & Spillane, J. P. (2016). Alignment and accountability in policy design and implementation: The Common Core State Standards and implementation research. Educational Researcher, 45(4), 243–251. https://doi.org/10.3102/0013189X16651080

Dieterle, S., Guarino, C. M., Reckase, M. D., & Wooldridge, J. M. (2015). How do principals assign students to teachers? Finding evidence in administrative data and the implications for value added. Journal of Policy Analysis and Management, 34(1), 32–58. https://doi.org/10.1002/pam.21781

Everson, K. C., Feinauer, E., & Sudweeks, R. R. (2013). Rethinking teacher evaluation. Harvard Educational Review, 83(2), 349–370.

Goldhaber, D. D., Goldschmidt, P., & Tseng, F. (2013). Teacher value-added at the high-school level: Different models, different answers? Educational Evaluation and Policy Analysis, 35(2), 220–236. https://doi.org/10.3102/0162373712466938

Hallinger, P., Heck, R. H., & Murphy, J. (2014). Teacher evaluation and school improvement: An analysis of the evidence. Educational Assessment, Evaluation and Accountability, 26(1), 5–28. https://doi.org/10.1007/s11092-013-9179-5

Hamilton, L. S., Stecher, B. M., Marsh, J. A., McCombs, J. S., Robyn, A., Russell, J. L., Naftel, S., & Barney, H. (2007). Standards-based accountability under No Child Left Behind: Experiences of teachers and administrators in three states. RAND Corporation. https://www.rand.org/pubs/monographs/MG589.html

Hill, H. C., Kapitula, L., & Umland, K. (2011). A validity argument approach to evaluating teacher value-added scores. American Educational Research Journal, 48(3), 794–831. https://doi.org/10.3102/0002831210387916

Hursh, D. (2007). Assessing no child left behind and the rise of neoliberal education policies. American Educational Research Journal, 44(3), 493–518. https://doi.org/10.3102/0002831207306764

Ingersoll, R. M. (2001). Teacher turnover and teacher shortages: An Oorganizational analysis. American Educational Research Journal, 38(3), 499–534. https://doi.org/10.3102/00028312038003499

Iowa Department of Education. (2019). The Iowa model educator evaluation system. https://educateiowa.gov/sites/files/ed/documents/IaMEES.pdf

Jennings, J. L., & Pallas, A. M. (2016). How does value-added data affect teachers? Educational Leadership, 73(8). http://www.ascd.org/publications/educational_leadership/may16/vol73/num08/How_Does_Value-Added_Data_Affect_Teachers%C2%A2.aspx

Johnson, S. M. (2015). Will VAMs reinforce the walls of the egg-crate school? Educational Researcher, 44(2), 117–126. https://doi.org/10.3102/0013189X15573351

Kappler Hewitt, K., & Amrein-Beardsley, A. (2016). Introduction: The use of student growth measures for educator accountability at the intersection of policy and practice. In K. Kappler Hewitt & A. Amrein-Beardsley (Eds.), Student growth measures in policy and practice: Intended and unintended consequences of high-stakes teacher evaluations. Palgrave Macmillan US. https://doi.org/10.1057/978-1-137-53901-4

Michigan Department of Education. (2019). Michigan educator evaluations at-a-glance. https://www.michigan.gov/documents/mde/Educator_Evaluations_At-A-Glance_522133_7.pdf

Minnesota Department of Education. (2013). Minnesota teacher evaluations. https://education.mn.gov/MDE/dse/edev/mod

Opper, I. M. (2009). Value-added modeling 101: Using student test scores to help measure teaching effectiveness. RAND. https://www.rand.org/pubs/research_reports/RR4312z1.html

Papay, J. P. (2011). Different tests, different answers: The stability of teacher value-added estimates across outcome measures. American Educational Research Journal, 48(1), 163–193. https://doi.org/10.3102/0002831210362589

Parsons, E., Koedel, C., & Tan, L. (2019). Accounting for student disadvantage in value-added models. Journal of Educational and Behavioral Statistics, 44(2), 144–179. https://doi.org/10.3102/1076998618803889

Plank, S. B., & Condliffe, B. F. (2013). Pressures of the season: An examination of classroom quality and high-stakes accountability. American Educational Research Journal, 50(5), 1152–1182. https://doi.org/10.3102/0002831213500691

Price, H. E. (2016). Assessing U.S. public school quality: The advantages of combining internal “consumer ratings” with external NCLB ratings. Educational Policy, 30(3), 403–433. https://doi.org/10.1177/0895904814551273

Rivkin, S. G., Hanushek, E. A., & Kain, J. F. (2005). Teachers, schools, and academic achievement. Econometrica, 73(2), 417–458.

Schochet, P. Z., & Chiang, H. S. (2013). What are error rates for classifying teacher and school performance using value-added models? Journal of Educational and Behavioral Statistics, 38(2), 142–171. https://doi.org/10.3102/1076998611432174

Stegall, Y. (2014, September 27). Newer laws make it easier to fire ‘problem’ teachers, keep good ones. https://www.tctimes.com/news/local_news/newer-laws-make-it-easier-to-fire-problem-teachers-keep-good-ones/article_f35cc3e0-4657-11e4-88ef-6b8728b892d1.html

Supovitz, J. (2009). Can high stakes testing leverage educational improvement? Prospects from the last decade of testing and accountability reform. Journal of Educational Change, 10(2–3), 211–227. https://doi.org/10.1007/s10833-009-9105-2

U.S. Department of Education. (2009a). Fact sheet—Race to the Top. https://www2.ed.gov/programs/racetothetop/factsheet.html

U.S. Department of Education. (2009b). Race to the Top program executive summary. https://www2.ed.gov/programs/racetothetop/executive-summary.pdf

Winters, M. A., & Cowen, J. M. (2013). Who would stay, who would be dismissed? An empirical consideration of value-added teacher retention policies. Educational Researcher, 42(6), 330–337. https://doi.org/10.3102/0013189X13496145

Wisconsin Department of Public Instruction. (2019). Wisconsin educator effectiveness system policy guide. https://dpi.wi.gov/sites/default/files/imce/ee/pdf/educator-effectiveness-system-policy-guide.pdf

Wright, S. P., Horn, S. P., & Sanders, W. L. (1997). Teacher and classroom context effects on student achievement: Implications for teacher evaluation. Journal of Personnel Evaluation in Education, 11, 57–67.