by Dr. Mark H. Shapiro
"Examinations are formidable even to the best prepared, for the greatest fool may ask more than the wisest man can answer." ... ... Charles Caleb Colton.
Commentary of the Day - December 5, 2005: This is Only a Test. Guest commentary by Poor Elijah (Peter Berger).
Back in 2002 President Bush predicted "great progress" once schools began administering the annual testing regime mandated by No Child Left Behind. Secretary of Education Rod Paige echoed the President's sentiments. According to Mr. Paige, anyone who opposed NCLB testing was guilty of "dismissing certain children" as "unteachable."
Unfortunately for Mr. Paige, that same week The New York Times documented "recent" scoring errors that had "affected millions of students" in "at least twenty states." The Times report offered a pretty good alternate reason for opposing NCLB testing. Actually, it offered several million pretty good alternate reasons.
Here are a few more.
There's nothing wrong with assessing what students have learned. It lets parents, colleges, and employers know how our kids are doing, and it lets teachers know which areas need more teaching. That's why I give quizzes and tests and one of the reasons my students write essays.
Of course, everybody who's been to school knows that some teachers are tougher graders than others. Traditional standardized testing, from the Iowa achievement battery to the SATs, was supposed to help us gauge the value of one teacher's A compared to another's. It provided a tool with which we could compare students from different schools.
This works fine as long as we recognize that all tests have limitations. For example, for years my students took a nationwide standardized social studies test that required them to identify the President who gave us the New Deal. The problem was the seventh graders who took the test hadn't studied U.S. history since the fifth grade, and FDR usually isn't the focus of American history classes for ten-year-olds. He also doesn't get mentioned in my eighth grade U.S. history class until May, about a month after eighth graders took the test.
In other words, wrong answers about the New Deal only meant we hadn't gotten there yet. That's not how it showed up in our testing profile, though. When there aren't a lot of questions, getting one wrong can make a surprisingly big difference in the statistical soup.
Multiply our FDR glitch by the thousands of curricula assessed by nationwide testing. Then try pinpointing which schools are succeeding and failing based on the scores those tests produce. That's what No Child Left Behind pretends to do.
Testing fans will tell you that cutting edge assessments have eliminated inconsistencies like my New Deal hiccup by "aligning" the tests with new state of the art learning objectives and grade level expectations. The trouble is these newly minted goals are often hopelessly vague, arbitrarily narrow, or so unrealistic that they're pretty meaningless. That's when they're not obvious and the same as they always were.
New objectives also don't solve the timing problem. For example, I don't teach poetry to my seventh grade English students. That's because I know that their eighth grade English teacher does an especially good job with it the following year, which means that by the time they leave our school, they've learned about poetry. After all, does it matter whether they learn to interpret metaphors when they're thirteen or they're fourteen as long as they learn it?
Should we change our program, which matches our staff's expertise, just to suit the test's arbitrary timing? If we don't, our seventh graders might not make NCLB "adequate yearly progress." If we do, our students likely won't learn as much.
Which should matter more?
Even if we could perfectly match curricula and test questions, modern assessments would still have problems. That's because most are scored according to guidelines called rubrics. Rubric scoring requires hastily trained scorers, who typically aren't teachers or even college graduates, to determine whether a student's essay "rambles" or "meanders." Believe it or not, that choice represents a twenty-five percent variation in the score. Or how about distinguishing between "appropriate sentence patterns" and "effective sentence structure," or language that's "precise and engaging" versus "fluent and original."
These are the flip-a-coin judgments at the heart of most modern assessments. Remember that the next time you read about which schools passed and which ones failed.
Unreliable scoring is one reason the General Accountability Office condemned data "comparisons between states" as "meaningless." It's why CTB/McGraw-Hill had to recall and rescore 120,000 Connecticut writing tests after the scores were released. It's why New York officials discarded the scores from its 2003 Regents math exam. A 2001 Brookings Institution study found that "fifty to eighty percent of the improvement in a school's average test scores from one year to the next was temporary" and "had nothing to do with long-term changes in learning or productivity." A senior RAND analyst warned that today's tests aren't identifying "good schools" and "bad schools." Instead, "we're picking out lucky and unlucky schools."
Students aren't the only victims of faulty scoring. Last year the Educational Testing Service conceded that more than ten percent of the candidates taking its 2003-2004 nationwide Praxis teacher licensing exam incorrectly received failing scores, which resulted in many of them not getting jobs. ETS attributed the errors to the "variability of human grading."
The New England Common Assessment Program, administered for NCLB purposes to all students in Vermont, Rhode Island, and New Hampshire, offers a representative glimpse of the cutting edge. NECAP is heir to all the standard problems with standardized test design, rubrics, and dubiously qualified scorers.
NECAP security is tight. Tests are locked up, all scrap paper is returned to headquarters for shredding, and testing scripts and procedures are painstakingly uniform. Except on the mathematics exam, each school gets to choose if its students can use calculators.
Whether or not you approve of calculators on math tests, how can you talk with a straight face about a "standardized" math assessment if some students get to use them and others don't? Still more ridiculous, there's no box to check to show whether you used one or not, so the scoring results don't even differentiate between students and schools that did and didn't.
Finally, guess how NECAP officials are figuring out students' scores. They're asking classroom teachers. Five weeks into the year, before we've even handed out a report card to kids we've just met, we're supposed to determine each student's "level of proficiency" on a twelve point scale. Our ratings, which rest on distinguishing with allegedly statistical accuracy between "extensive gaps," "gaps," and "minor gaps," are a "critical piece" and "key part of the NECAP standard setting process."
Let's review. Because classroom teachers' grading standards aren't consistent enough from one school to the next, we need a standardized testing program. To score the standardized testing program, every teacher has to estimate within eight percentage points how much their students know so test officials can figure out what their scores are worth and who passed and who failed.
If that makes sense to you, you've got a promising future in education assessment. Unfortunately, our schools and students don't.
© 2005, Peter Berger.
Peter Berger teaches English in Weathersfield, Vermont. Poor Elijah would be pleased to answer letters addressed to him in care of the editor.
The IP comments: Poor Elijah has identified many of the flaws in the testing requirements of NCLB, and of standardized tests in general. Unfortunately, there are some who essentially want to abandon all testing, because of the imperfections found in almost all standardized tests. The IP takes a more measured view. Namely, that standardized tests have value in educational assessment. However, they should not be the only assessments used to judge how much progress an individual student, teacher, school or school system is making.