What’s the point of standardized tests?
[O]n Wednesday, Barbara Byrd-Bennet, the [Chicago] district CEO, said that she still has big concerns about the [PARCC] test and doesn’t want to administer it to students this spring:
“The purpose of standardized assessments is to inform instruction. At present, too many questions remain about PARCC to know how this new test provides more for teachers, students, parents, and principals than we are already providing through our current assessment strategies.” [from a recent Washington Post article]
No, alas. The purpose of large-scale standardized tests is to audit performance, not be an authentic, transparent, and therefore useful assessment of key goals. Nor, by the way, do current district tests in Chicago or most anywhere else meet her standard. How can any test inform instruction if it stays secure?
This is dismaying to realize but it has always been thus. Large external tests – including most district and many school tests – are almost never designed to be exemplary feedback. How could they be, if the test remains secure and if the results come back weeks/months later in only a general form? In addition, by relying on proxies for authentic work – because that is what a multiple-choice item is, a proxy – the feedback, such as it is, is practically useless.
Thought experiment. To see this oddity more clearly, imagine if we tested sports teams the way we test students now. Imagine if league championships in soccer or basketball were not decided by “local” games but by district, state and national tests on secure drill-like “items.” And now imagine if the test gave you no feedback as you shot, i.e. you didn’t know whether or not each shot – yours or other teams’ – did or didn’t go in. Imagine waiting weeks – ‘til the middle of the next sports’ season – for the “official results” and yet having no access to either the tasks or scoring methods used to support those results. Who would learn to be a better ballplayer (or coach) under such a system?
So, it is wrong to claim that mass secure testing provides helpful feedback and accountability. Not when the whole system is a bunch of secrets. (cf. the pending lawsuit against New York State, brought by a teacher whose accountability score cannot be challenged or analyzed; this defies common sense fairness.)
One clear solution: complete openness. A pedagogically sensible solution is to release all tests and the item analysis after the tests are given, so that teachers and students can learn from them. This is what Florida used to do in a clear and elegant way in its old FCAT and which Massachusetts still does (though less often). Here is an example of the helpful data provided for each test question from the old FCAT:
The Massachusetts MCAS remains the most open state testing system. You can go to the MCAS site and see results from the last decade (and thus use their tests and analysis for your own purposes locally today):
Ohio offers something more thorough in its K-8 testing (even though few items are released). Look at all this useful information:
Great, you say! No. Because Florida long ago stopped releasing such information on its tests and Massachusetts only releases a few items now, like Ohio.
The culprit? Cost. It is very expensive to release all the test items. And the situation has gotten far worse in recent years: test companies strike deals with states to protect their intellectual property and demand that few items be released to protect it. And state departments of education let them get away with it.
So, any realistic hope of making test results formative is disappearing, to the harm of learning.
An old lament. Here’s what bugs me. Many of us made this argument 20-25 years ago. (Here is my piece from 20 years ago: immorality-of-test-security-wiggins.) The limited value of secret one-shot standardized tests as feedback has been known for decades. They may be acceptable as low-stake audits; they are wretched as feedback mechanisms and as high-stakes audits. Why don’t audits work when they are high-stakes tests (unlike, say NAEP or PISA)? Because then everyone tries to “game” them through test prep. This inevitability was discussed by George Madaus and others 40 years ago.
Openness is everything in a democracy. Without such openness, what difference does it make if the PARCC or SB questions offer better tests – if we still do not know what the specific question by question results are? There can be no value or confidence in an assessment system in which all the key information remains a secret. Indeed, in some states you can be fired for looking at the test as a teacher!
PARCC or no PARCC, educators and educational associations should demand that any high-stakes test be released after it is given, supported by the kind of item analysis noted above. We don’t need merely better test questions; we need better feedback from all tests. Fairness as well as educational improvement demand it. And PS: the same is true for district tests.
PS: A few people have written asking me why I am calling large-scale M. C. tests “audits” of performance. They are audits in the same way your business is audited or in the same way once per year you go for the physical at your doctor’s office. Neither the audit nor the physical is the true measure; they are efficient indirect indicators or proxies for the real thing. Your aim is not to get good at the physical or the audit. It’s the other way around: if you or your business is “healthy” it shows up on the proxy test. That’s why test prep as the only strategy locally is dumb: that’s like practicing all year for the doctor’s physical; it confuses cause with effect. The point is to for local assessment to be so rigorous and challenging locally that the kids easily pass the state “physical” once per year.
17 Responses
This frustrated me year after year. In Missouri we get the MAP (Missouri Assessment Program) itemized benchmark descriptors. When they first released these for each question it was fairly specific listing type of punctuation or specific reading strategies after the induction of Grade Level Expectations in the state that was what was listed, the GLE addressed by the question. Some of the GLEs covered so many topics it made it near to impossible to figure out what the problem might be. All I could do when analyzing the data was to see which GLEs we struggled with, however when it was R3A or R3C (reading fiction/nonfiction) it was no help those GLEs were huge. We had sample questions and how they were answered but nothing on the actual test. I haven’t seen the IBDs since the move to Common Core and I think I’m glad.
A perfect example of my point – a whole lot of so-called analysis that turns out to be as unhelpful as saying: your kids need to read more and read more carefully. Thanks loads!
Yeah really helpful, even when analyzing several years worth of data I couldn’t find any helpful trends because the test and the GLEs tested each year changed. As teachers we were specifically told not to look over our students shoulders to read the questions.
I have repeatedly inquired of #PARCC representation & IL leadership about data release. PARCCC’s Assessment Blueprints and accompanying Evidence Statements (http://parcconline.org/assessment-blueprints-test-specs) reference meta-data–access to such information would be valuable to teaching & learning–even without full item access.
Please let us know what you learn. Meta-data simply isn’t enough – most people cannot parse it.
I agree, it’s not enough, but it would be better than the nothing typically provided from most large-scale standardized assessments.
Agreed!
Yep. On our “Local Common Assessment”, I get to know a RIT range and a corresponding percentile, and breakdowns in “Algebraic Thinking, Real&Complex Number Systems, Geometry, Statistics and Probability.”
Okay … I’m ready to adjust my teaching for Algebra 1 … What changes should I make?
I see none of the questions, none of the individual responses. I have no idea what kinds of things they considered to be “Algebraic Thinking” nor do I have any sense of what my students might have replied, except for the kids who told me they just clicked at random just to be finished more quickly.
Okay … I’m ready to adjust my teaching for Algebra 1 … What changes are appropriate? Does the kid who scored “LO” really not understand or is she just lazy?
Yeah, that’s the breakdown measurement: LO, AV, HI. Useful?
Then, consider that we have our ninth graders taking a test and one of the categories is Real & Complex Number systems – Really? These are 9th graders in pre-algebra and algebra 1 … is the score range of 238-250 based on their less-than-complete knowledge of real numbers combined with no questions on complex numbers or is that 75%-ile based on questions that they would have no reasonable knowledge of?
Okay … I’m ready to adjust my teaching … What changes to my pre-algebra curriculum are appropriate for the kids who scored LO on Algebraic Thinking?
In another class, I have LO, AV, HI … what changes do I make?
How about for the Algebra 1?
I’m not even going to talk about the SBAC or the PARCC or NECAP or NSRE … those are test in October, Scores in April, and there was no way you could trust those scores because of the manipulation of the raw score conversion tables for continuity reasons. Can’t have a big improvement year to year because “reasons.” The first year of every test has to have similar results as the final year of the test we threw away, so yr1 NSRE was first 58% passing, but was re-scored so we only had 30% passing.
If we’re getting rid of a test because it isn’t working appropriately, why do we insist that the new test’s scores match up with the old test’s scores?
Our district admin hasn’t given us a report of NJASK scores in 3 years, nor has there been any discussion about it.
As for district tests, I am now forced to give 5 multiple-choice lower-order thinking assessments each year. These are referred to as “benchmark” assessments, to be administered within 3 days of the three other teachers that have authored the test. Majority rules in the authoring of test questions, and new and higher-order questions must be argued into existence.
The rationale for the 5 benchmarks is that 8th grade students should be guaranteed a common experience, regardless of their teacher. Agreed. Unfortunately the tests have been frankensteined together from previous tests; the starting point with common assessments must be clear articulation of assessment objectives–to date the objectives have been assigned to the questions after the questions have been created, rather than in true UbD spirit. A majority of teachers have zero training in UbD. The benefit of the data, at least, is that we can in fact use it to inform instruction. The questions have become a function of what the students of the teacher with the weakest background knowledge in assessment practice and science instruction are able to answer.
Not only do I get virtually no feedback on state tests, unless I personally meander through the system of student data, but I now have considerably less say in when and how my students are assessed.
So depressing. You would think that admins would be leading the charge for better transparency and assistance in assessment. And the dreariness of local tests that pander to the lowest common denominator of course undercuts the whole point of testing. This was one of my key points at NJEA on assessment the other day: local common assessments are actually hurting things more than national tests – and your example explains why. My suggestion is to argue for a set of specs & criteria that a common test must meet – at the very least accurately reflect stated course goals. (And if there are no clear course goals, make that Job 1.)
Most admin is leading the charge, but I’m trapped at the lowest level; I have emotional support from a majority of upper admin, but no ACTION. In the ten years I have worked in NJ, I have had 4 superintendents, 2 principals, 4 vps, 3 curriculum coordinators, and 2 science administrators–no one can see the big picture, not for wont of trying.
Agreed that course goals must be Job 1. Working on it. Thanks for the validation.
The other issue with all of the testing is the colossal amount of money we’re spending on this as school systems and as a nation. We have classrooms without enough books to engage kids in real reading, and computer use in some schools is an “event” because there aren’t enough. Meanwhile, tons of public money is going to private companies to create and distribute tests that don’t help improve real learning!
Are we preparing kids for our past or their future?
As someone who works in International education I don’t have to concern myself with US tests, but the contrast between the IB and AP is interesting. With IB, papers are always released each year, and even the marked students scripts can be returned to the school on request (and for a fee). The IB also give useful stats on each paper/component. AP however only release tests occasionally and are generally obsessed with keeping exams secret (an instruction given at the start of each AP exam is that the student should not discuss the contents of the paper with anyone, including their teacher, ever!).
As Grant says, this seems to come down largely to money: the cost of producing new test questions each year etc. However, since the IB can do it I wonder why the College Board can’t?
Doesn’t the IB cost each school a fair amount of money to be involved?
Gymnastics, boxing, figure skating, freestyle anything….
Transparency may be superior, but any kind of rating system can often be superior to no rating system.
Agreed! I have often cited these very examples as models for what we should be striving to emulate in academics. See, for example, ACTFL I CAN statements in foreign language, and my multiple blog posts on track as an example
I was in a meeting (online) the other day and one of the participants had to go because they were getting ready to test the Kindergarteners! Excuse me?
If pharmaceutical (okay maybe not the best example) and marketing companies only “test” a portion of the population why does education need to test every kid, every year, on everything? Would a sampling suffice?
Perhaps I’m wrong but as a Canadian it seems that US teachers never get to really teach. They are constantly prepping for tests or administering tests or worrying about tests. Ultimately, even if you get this summative information in a timely matter you’ve already moved on (to the next test). It’s a bit like getting the weather report a week late when all you had to do at the time was stick your head out the door.
The worst thing (in my mind) about all these tests is they make teachers question their own abilities to assess student learning. I would suspect that most teachers know their students and the tests only serve to confirm what they already know. How much better might it be for learners if educators, with their colleagues (both in school and virtually), talked about improving practice, using random sample test scores as only one very tiny piece of the big picture?