by Dr. Mark H. Shapiro
"Experience is the worst teacher; it gives the test before presenting the lesson.".... ...Vernon Law.
Commentary of the Day - April 5, 2006: Stairway to Nowhere. Guest commentary by Poor Elijah (Peter Berger).
There's an old joke about English teachers who grade compositions by tossing them down the stairs. Landing on the first step means an A, and so on down the alphabet. This system offers advantages. First, it's fast. Schools and students can have the results within minutes. Second, it's cheap. All you need is a two-story building. Finally, it's likely no less meaningful than your state's academic assessment system.
Before you get the wrong idea, I'm not against tests. I disagree with The Washington Post columnist who's "never given a test" because he "respects [his] students too much to demean them with exercises in fake knowledge." He boasts that he's spent his career "telling students not to worry about answering the questions." Instead he wants them to "question the answers." This, of course, is difficult to do if you don't know any answers.
I give tests because they help me determine whether my students have learned what I'm trying to teach them, like how to read and why Americans fought the Civil War. I don't consider these things "fake knowledge."
Standardized tests can be helpful, too, provided you remember that questions frequently don't match what a particular class has covered at a particular time. If your class studies division in May, and the test is in April, it'll look like your students can't divide when they just haven't learned how yet. Standardized scores are also increasingly based on subjective judgments by scorers with questionable training. As a result, scores have become less reliable and therefore less meaningful.
Public schools have suffered for decades under the lunatic rule of experts like the Post columnist. Their disdain for content, knowledge, and tests has spawned a backlash obsession with standardized testing.
The result has been an epidemic of assessment fiascoes. In 2003 the National Board on Educational Testing and Public Policy confirmed the "proliferation of undetected human error in educational testing" and predicted more.
The detected errors have been alarming enough. According to The New York Times, scoring errors have forced officials nationwide to recall and void results involving "millions of students," including episodes here in Vermont and a single glitch that affected "250,000 students in six states." In 2004 the Educational Testing Service misscored 27,000 teacher licensing tests. Recently the College Board confessed that 5000 students had received incorrect SAT scores last October. These meltdowns meant that the wrong kids stayed back, qualified freshmen didn't get into college, and 4000 competent teachers couldn't get jobs.
These are the tests that determine whether your school has made the "adequate yearly progress" mandated by No Child Left Behind. These are the data compiled in the charts in your local newspaper, even though a study presented at the Brookings Institution found that "between 50 and 80 percent of the improvement in a school's average test scores from one year to the next was caused by fluctuations that had nothing to do with long-term changes in learning or productivity." A RAND analyst concurred that all the touted testing is identifying "lucky and unlucky schools," not "good” and "bad" schools. A Government Accountability Office report concluded that all the "poor and unreliable" data render comparisons "meaningless."
In Vermont we recently received the results from last fall's statewide tests. My students did pretty well, but the results are nonetheless pretty meaningless. They'd be equally meaningless if my students had done poorly, but then my objections might be considered "sour grapes".
Our state test, the New England Common Assessment Program, is administered in Vermont, New Hampshire, and Rhode Island. I don't doubt that the officials and educators who designed NECAP deliberated long and hard and had good intentions. But that doesn't make the process scientific, or the resulting scores valid and reliable.
For example, students took the tests in October after their teachers had known them for just a few weeks. NECAP nevertheless required us to grade each student based on our "perception" of his or her "demonstrated readiness." After so little time our "perceptions" predictably were little more than guesses. Yet our tentative speculations were a "critical piece" and a "key part of the NECAP standard setting process."
As with most assessments, most NECAP scorers have never been teachers. Scoring was based on ephemeral distinctions like the difference between details that "support the purpose" and those that "mostly support the purpose." Raters next selected pieces exhibiting the worst writing that in their subjective judgment was still "proficient with distinction," "proficient," "partially proficient," and "substantially below proficient." The "cut scores" were based on these determinations, and those arbitrary numbers based on this year's subjective judgments will be the permanent NECAP thresholds for passing and failing in future years.
By the way, when it comes to scoring writing portfolios, Vermont's Department of Education standard for reliability is sixty percent. That means scorers can be wrong almost half the time and still be considered accurate.
The NECAP writing test required students to complete "planning boxes" for brainstorming their "focus, details, and conclusion." Even if the finished essay itself perfectly provided all three, if the boxes weren't filled in satisfactorily, the student's score automatically was lowered by twenty percent.
Officials vetted essay topics in an attempt to eliminate bias against any group. For example, a writing task involving a fire was disqualified by the Bias/Sensitivity Review on the grounds that Rhode Island students might have been traumatized by a night club fire that occurred there a few years back. However, this year's extended essay required writers to reflect on baby sitting, a subject which is arguably more familiar territory for female students. It might be that both topics are fine, but the process for making these decisions is hardly scientific.
Meanwhile, on the math exam some students could use calculators, while others could not. Despite NECAP's repeated insistence on uniformity, on this critical point each school gets to decide for itself. Even more bizarrely, there is no place to indicate on the test whether calculators were used and no allowance or distinction made when schools' scores are published and compared.
Most statisticians consider groups smaller than thirty, or even sixty, too small for statistical comparisons. NECAP reported scores from classes as small as ten.
District and school scoring results typically fluctuate widely from year to year. In our school, as in many, the teachers don't change, which means the annual variations are due to other factors. Either the students are different each year, which they are, rendering NCLB-mandated comparisons between this year's eighth grade and next year's absurd, or the tests are shamelessly invalid and unreliable. Either way, it makes no sense to rely on those results to rate your school.
Schools typically respond to low scores by conceding their disappointment and voicing clichés about how standardized assessments provide only a piece of the picture. We need to stop giving today's tests even that much credit.
Yes, some schools have genuine problems. Whether or not you like the sound of the word, some students are failing. And I need to be accountable for the job I do in my classroom. But tossing kids' compositions down the stairs wouldn't deserve much notice, analysis, or weight.
Neither do all the charts and meaningless numbers for which we're paying so dearly in time, money, and focus.
© 2006 Peter Berger.
Peter Berger teaches English in Weathersfield, Vermont. Poor Elijah would be pleased to answer letters addressed to him in care of the editor.
The IP comments: Poor Elijah's concerns about the shortcomings of standardized tests are well founded. However, The IP is of the opinion that standardized tests do have some value. For all the reasons that Poor Elijah mentions, they often don't measure what we would like to know about individual students or teachers. And, in many cases they don't tell us much about individual schools. However, they do have value for making more global comparisons. In addition, many of the problems that Poor Elijah raises about the particular test used in the Vermont schools seem to be ones that could be corrected with a little effort.