Now, I don't claim to have particularly deep expertise in this arena, I am more of a policy generalist and observer who counts among his friends a number of deep thinking experts in various specialties and whose brains I occasionally pick, but that would be the extent of my "expertise" in education assessment policy from a statistical perspective. I do not come from an education background. I have a BS in Economics, which means I did have my share of upper level courses in math and statistics in college. I also have been privileged to be in some pretty interesting and unique positions professionally to watch the evolution of the data accountability model in education since the early nineties on a state/local level and then on the national stage, for the past almost thirty years.
Through this evolution of nearly three decades I have recently put the pieces together and have come to describe the current "growth model" in education circles as the "Emperor Has No Clothes", to use an analogy. Experts quietly will say this is a terribly flawed way to measure academic growth and/or prepare students for life. It is not an assessment tool to help inform the teacher in the classroom with actionable information and it is certainly not a good way to judge teachers or schools. But no one is willing to step out and say it. Instead these models are held up as the paragon of education accountability. I wonder.
There has been too much investment, if that is what you want to call it, and currently too much money at stake to deviate from the current course, no matter how flawed. Hundreds of millions, if not billions, of dollars are at stake, nationwide. So, I am saying it, the growth modeling of today is terribly flawed and the unexpected negative consequences are large, having an impact that is not favorable especially for teachers and schools, much less students.
To all my friends in the education policy, data and assessment world, please correct me where I might be off the mark. I don't think that I am.
The idea behind "measuring" academic growth as opposed to measuring grade level “academic proficiency” against a set of standards is not new and is fairly simple. It even makes some sense on paper. For example, if a teacher inherits a classroom of fifth grade students who are several grade levels behind, relative to their age, how or why should that teacher be measured and judged in terms of effectiveness if a standardized test only measures against grade level proficiency for the fifth grade? In this hypothetical example, the entire class, being so far behind already would not be able to pass a fifth grade proficiency test at the beginning of the fifth grade year and likely not even at the end of the year, no matter how good the teacher. But, if it can be demonstrated that the class overall advanced from three grade levels behind to just one or two grade levels behind during one academic year, while still not at grade level proficiency, that teacher should be rewarded and celebrated for providing "academic growth" even if that class is not at grade level. Makes sense, right? Quality teachers do this frequently without much recognition. I am all for rewarding that type of success in the classroom.
Well, as the saying goes, "the devil's in the details".
I often ask this question of my statistician and psychometrician friends in the education assessment world regarding "academic growth"; "If you were to design an assessment system from the ground up, measuring academic growth at the individual student level, would it in any way resemble what is now being sold and practiced across the country as 'growth models'"? After the inevitable eye roll, chuckle or a hearty laugh, the answer is a resounding "NO".
So, why do we currently have today’s so-called academic growth models that are being utilized all across the country to judge teachers, schools and districts?
I might have an answer by peeling back the onion on at least one highly touted and utilized "growth model" by going back in time. I will argue it all goes back to money, and lots of it. An entire multibillion dollar industry has grown up around "academic accountability" and assessments in recent years and there are strong forces that work hard to keep it that way. To use an old term with a twist, let's call it the "Education Industrial Complex". By the way, it is a little known but true fact that overall spending on K-12 education in this country is higher than the United States National Defense budget. Think about that the next time someone argues it is a matter of spending more money on education. This country is deeply committed to education at every level, but I digress.
My first deep exposure to public policy in general and when I first cut my teeth on education policy and reform, including academic assessments, was in the early nineties as a young freshman legislator in Tennessee. In hindsight it was a very interesting place to be as Tennessee was gaining a national reputation for annual state testing of students and retaining the results in a database that then covered several years. Surely this test data could be put to good use in some manner for study and improvement of our education system, right? That was the thinking at the time. It's data after all and the entire world is moving into the world of data analysis and utilization, right? Surely it has some value. But what value would a database of static test scores really have? What information could it provide to improve education and/or the quality of teaching? In reality, not much without some tinkering, manipulation and data modeling.
In 1992, as a legislative response to a very significant law suit filed and won against the state of Tennessee by small rural school districts for equitable funding, Governor Ned Ray Mcwherter pushed forward legislation to address the court decision by restructuring the state funding model to level the financial playing field.
States have primary constitutional responsibility to provide education for all children. The U.S. Constitution is silent on the matter but literally every state constitution includes language protecting education as a right and not just a privilege. As a result of this and similar lawsuits across the country, states began to play a much larger role in education than in the past when local school districts had more autonomy for curriculum, instruction and results.
The 1992 bill, titled the Tennessee Basic Education Plan (BEP), also included a variety of education reforms intended to improve instruction and increase teacher pay, etc. It was a comprehensive and much debated effort to improve public education in the state. Some of the discussion revolved around the question of how to make good use of all the accumulated test data sitting on a server at the Tennessee Department of Education. Over the years the state had invested heavily into this testing program and gathering of test data. It was already a big investment but how could it pay off?
During and before the 1992 legislative session along came a gentleman from the University of Tennessee by the name of William L. Sanders, PhD. Dr. Sanders was a statistician at the UT School of Agriculture whose job, as anecdotally told, was to count the number of flies on cattle by using mathematical algorithms. I am not sure of how accurate that was but it made a good story.
Dr. Sanders, courted Tennessee policymakers and argued he knew how to bring new value to the multiple years of K-12 test data at the TN Dept of Ed. It got attention by creating and applying an algorithm to the data.
Dr. Sanders testified numerous times before education committees that year and ultimately proposed to utilize statistical predictive modeling to make use of all the accumulated test data for the purpose of demonstrating the effectiveness of teachers in the classroom. In layman's terms he basically believed and argued that he could take test data of a cohort (e.g. classroom) of students, especially if there were multiple years of that same grouping of students, create a matrix plugging in a number of weighted variables such as socioeconomic background, age, gender race, etc., and create an algorithm that could "predict" how that same set of students should or would perform on future state assessments, given good quality instruction. By measuring this predictive number against the actual next test he argued that the results of this analysis would demonstrate the effectiveness of teaching, or lack thereof, for that year, for that teacher in that classroom with a static sampling of students. Those results could then be used as a basis for teacher merit pay, etc. The algorithm he created to accomplish this prediction was and is a highly guarded and protected secret, known only by Dr. Sanders and his team.
Interestingly, this is the same type of algorithmic model that is used to predict the path of hurricanes or other weather patterns. Think about that the next time you hear or see a weather forecast and compare it to actual results.
From a statistical perspective, the key to this model is utilizing the same basic grouping of students from one year to the next in order to have a reliable model. By today's standards, it was pretty rudimentary. But, it allowed the state of Tennessee to make use of the previous “investment” in test scores. Policymakers liked that idea as it could be used as a political argument to say "we are holding schools and teachers accountable for the millions of dollars being spent in Tennessee to educate your children. We now have a tool to assess teachers and schools and we are making a new additional investment on top of the old investment". At the very least, it would buy them some time.
The legislature and Governor agreed with Dr. Sanders and thus was born what is now known as the Tennessee Value Added Assessment System (TVAAS). It was created as part of BEP and was regarded nationally as a significant step forward in the world of education accountability to be able to evaluate classroom teaching, not that anyone really understood how it worked or how the algorithm was structured or applied. It became a matter of faith that it was valid and reliable. That should have been a red flag but it was not.
The TVAAS predictive model was put in place in Tennessee and became a tool for districts to measure teacher effectiveness and quality of teaching for the remainder of the decade. I am sure there were some kinks and tweaks over those years but the basic model remained intact and it began to be used by districts to measure teacher effectiveness.
Ten years after the creation of TVAAS, in January of 2002, President George W. Bush signed in to federal law one of the most comprehensive education reform efforts at the federal level in history, the No Child Left Behind Act. Technically, NCLB was not new federal law. It was a revision of a law that had been on the books since 1965 known as the Elementary and Secondary Education Act (ESEA). Arguably, while not officially, ESEA of 1965 was the legislative response to the 1954 Supreme Court Case known as Brown vs. Board of Education. It was passed as part of President Johnson's "Great Society" program. The original purpose of ESEA and its newest iteration was to address the inequities in education opportunities between the majority white population in the country and the minority population in the country. NCLB was focused on closing the achievement gap that existed along racial lines, primarily. It required, among other things, that every student in the country in grades three through eight and one time in high school, would be tested against a set of grade level academic proficiency standards that were customized and adopted by each state subject to approval by the U.S. Secretary of Education. The overall rationale behind NCLB was that Congress wanted to get a snap shot picture of how well the nation's schools were performing in order to justify the billions of federal dollars that were flowing to the states to fund ESEA/NCLB. Its intent was to shine a light on the dark corners of education across the country and expose the groups of students that were not being well served by the current public education systems. These students were simply being passed along from grade to grade without any success or skills being developed. NCLB passed with overwhelming bi-partisan support. It was very controversial, especially among the education community and it became the law of the land. Interestingly there was no mention of measuring academic growth under NCLB. Growth measurement was not allowed under the new law.
In the early 2000s, the use of data as a tool for improvement was relatively young, especially in the K-12 education space. While much data was beginning to be collected and analyzed for a variety of purposes it was pretty much the Wild West in terms of how to properly collect, analyze and use data for improvement of public schools. NCLB began to put some structure and standards in place so the test data could begin to make some sense. Comparisons could begin to be made between states, districts and schools to provide some level of transparency into our systems.
My second deep professional exposure to education policy began when I joined the administration of President George W. Bush in 2002. NCLB was just signed in to law and was being implemented across the country. I had a fancy title of Deputy Assistant Secretary in the U.S. Department of Education where I spent the better part of three years traveling the country to explain and defend NCLB to state policymakers and education leaders. I met a lot of education leaders across the country and it was a great learning experience. It was during that time that I became aware of the concept of measuring academic growth instead of or in addition to simply measuring proficiency against a set of standards and it took me a little time to wrap my head around it.
Even though states were not allowed to use academic growth as one of their tools for accountability, the concept was interesting. It also made some sense to credit successful teachers for academic growth if it could be constructed properly. There were a handful of states early on that expressed interest in developing an academic growth model to incorporate into their state's accountability framework but federal law did not allow for the growth model and no waivers were being granted by the Secretary of Education, so the idea ended up on a shelf.
I discovered that educators, by and large, were concerned about the accountability model for teachers under NCLB and rightly so, as in the example I mentioned at the beginning of this piece. I also heard over and over from educators a similar comment to the effect of "I don't mind testing and data, but I want/need test data that will help me tomorrow in the classroom to inform my instruction, not next year for lesson planning when it may be too late. I need it as a tool to help Johnny read tomorrow, not next year". Those comments stuck with me and it is part of the story.
As assessment technologies have improved over the years since NCLB, so have the policy conversations, especially regarding "growth models". This is where it gets more complicated and interesting, but it is necessary to understand as there are different types of assessments and they are used differently. All tests are not alike.
For example, to measure academic growth at the individual level, the same type of assessment must be administered to a student at three, for statistical reliability, points in time. It is also important to understand the different types of assessments because since not all assessments are the same they accomplish the different goals.
Here is my breakdown of the different types of assessments from my non expert perspective.
1) Norm Referenced Tests (NRT). This is the type of test many of us grew up with, administered to everyone in the same grade in several states or across the country and all the results are placed on a bell curve, preferably on a national level, as the larger the overall sample size the better the accuracy. Results were generally delivered to the test takers in terms of a percentile or quartile ranking. This is probably the oldest model of education testing on a broad scale. The old Iowa Standardized Test comes to mind.
2) Criterion Referenced Test (CRT). This is the type of test that is required by NCLB and all statewide education assessments. These tests are specifically designed to measure against a set of standards at grade level to determine proficiency. CRTs came in to favor leading up to the passage and implementation of NCLB which then made CRTs a national requirement. These are typically multiple choice in structure. Every state developed their own set of academic standards by grade level and then a CRT was to be administered to measure against those standards and reported to the state by school and district levels.
3) Computer Adaptive Test (CAT). In the simplest of explanations, CATs are administered on computer and they are iterative in nature. When a correct answer is selected, the next question will be a little harder. Conversely, when the answer selected is wrong, the next question will be a little bit easier. These types of assessments are more sophisticated than the NRT or CRT assessments but they can also provide faster and superior results for the teacher to inform instruction. If administered properly, the teacher in the classroom can quickly identify deficiencies and/or strengths for individual students and adjust instruction accordingly, in near real time. These assessments are best used to measure academic growth from one point in time to another point in time.
In the late 2000s, as NCLB began to approach 2014 and the date at which the goal of 100% academic grade level proficiency was to be met, along with the advancement of assessment technologies in general, more talk and focus began to be placed on the concept of measuring academic growth as a secondary means of assessing teachers, schools and districts.
If you talk with experts who have deep understanding of these assessments and the different requirements of how they should be applied and ask how to measure academic growth, you will likely hear an answer that involves a CAT and three points in time spread out over at least a year and measure progress at an individual level. In other words, as an example, administer a CAT at the beginning of a school term (a) in math, reading, etc. to determine exactly where that student's level of understanding is on the curriculum. Mid-year, administer the same assessment (b) to determine if things are on track or not and to make adjustments, and then again at the end of the year (c). The difference (delta) between a and c is the measurement of academic growth for that student.
In 2015, ESEA/NCLB was once again reauthorized and restructured under a new name, the Every Student Succeeds Act. This most recent revision allows states much more flexibility in their state accountability plans, including the possible use of “academic growth models”.
Once again, policymakers looked at the past “investment” in state test data and asked “how can we make use of all this past test data to create a growth model?
Of course the better question that should have been asked was “how do we best create an academic growth model and what type of assessment would that be”? If that question was asked, it died very quickly.
Again, the rationale became “we have made such a huge investment already, how can be use what we have and adapt it to measure academic growth down to the student level”. In other words, how can we spend good money after bad rather than going back to the drawing board and rethink/reengineer academic assessments to create a student level growth model?
By this point in time, entire careers and a multibillion dollar industry had been built around the CRT model of assessments and building algorithms to force the square peg of static test scores into the round whole of “academic growth”.
In 2000, the algorithmic formula for TVAAS was acquired/bought by a private corporate third party and Dr. Bill Sanders went to work for the company to expand their footprint across the country. The company took the TVAAS formula/model, repackaged it as the Education Value Added Assessment System (EVAAS) and sold the model and their services to states and districts all across the country.
Recognizing the move toward and growing demand for an academic growth model, TVAAS/EVAAS again began to morph into another predictive algorithmic structure. The corporate designers of the new and improved TVAAS/EVAAS model successfully argued that their new algorithms could not only measure academic growth for a cohort of students utilizing a static test score but they also convinced policymakers they could break the data down to a student level to measure academic growth. Really?
Policymakers again bought the argument that the past investment could be leveraged to create an entirely different assessment model and began pouring good money after bad.
In the late nineteenth and early twentieth centuries the makers of horse drawn carriages or buggy whips could not envision improving personal transportation by gas powered motorcars. They tinkered around with the idea and some invested and added room for an extra horse or even a motor to their existing carriages but they could not envision a different and better model because they were so invested in their model. Then, along came Henry Ford. He didn’t build the Model T from a horse carriage. He created a new model that changed the world of transportation.
In more recent times, Steve Jobs didn’t see value in simply upgrading and improving flip phones for more functionality. He asked his engineers to create a touchscreen tablet which became the iPhone and also changed the world.
I think it is time to reengineer the academic growth model from the ground up and stop the madness of adding motors to the horse drawn carriage because as far as I can tell, the Emperor Has No Clothes and it is time to start telling him.Share on Twitter Share on Facebook Share on LinkedIn