Standards, Part 3: The Arbitrary Single Grade or Score

In my previous two posts on standards, I pointed out that though we talk of “a” standard, it would be more accurate to talk about the three different kinds of standards involved in any standard: content, process, and performance. And in Part 2 I talked about the important challenge of figuring out to anchor our assessment system locally to ensure adequate rigor in scoring of work, using the example of my daughter’s times in track. I concluded that there is no such thing as “the” anchor: it depends upon your aspirations.
Which samples of student work should serve as anchors? It depends…
So, now consider the academic standard-setting question: which samples of student work should anchor our local performance assessments? If our aim is to ensure that students know where they stand, what anchors should we use and by what process should we establish them? We now know a negative answer: it cannot just be the work of our best students. Perhaps the work of our best students is either “too” good to use in judging everyone else fairly. More realistically, the current problem in schools is at the other end: the work we tend to give A’s to locally often does not meet “standards” as defined by state and national examples.
Here is a simple illustration of the issue, nicely captured in a report card used until recently in San Diego.

Cool! We see how each student (unnamed) and the entire school is doing against “grade-level” standards, pegged to an objective performance standard – the student’s level of reading comprehension using the degrees of reading power, keyed to specific books of specific difficulty.
Irony: they no longer use it. They went to a “standards-based” report card. But as far as I can tell, the teachers have no anchors or guidance on how to ensure that when they give grades their own standards reflect any specific and consistent standard of performance. So, it is less rather than more likely that students and parents will know how kids are really doing.
How good is this paper?
Writing is more in your face than reading. The evidence is in front of us; we should be able to say – can this kid write? Not so fast. Let’s consider the difficult challenge of establishing anchors in student writing, at the high school level. Students write for their teachers, often for their districts, for their state, and nationally on NAEP and in such programs as SAT2, AP, and IB. We know that few teachers refer to state or national performance standards when grading; they just grade against personal standards greatly influenced by local school norms – i.e. their best students get As and we go from there.
Let’s consider what states do offer. Here is where you can find, for example, anchor papers, rubrics, and commentaries for the state of Tennessee. Here is an 11th grade paper, written under exam conditions. How competent do you think this performance is, on a 6 point scale?

Here is the rubric:

The state judges gave it a 4. They are saying to all educators in the state that this is a competent piece of writing.
Furthermore, we know how this 4 compares to statewide performance. Here are the results state-wide:

More than 85% of the student writers in Tennessee are at the Competent, Strong, and Oustanding level??? Not credible.
So, we ask, then: how did the anchors get established statewide? Note the second-to-last paragraph on the rubric: 75 teachers are nominated and they choose anchors. On what basis we have no idea.
This matters – a lot: we know from NAEP data that Tennessee is at the middle of the pack nationally in state writing performance (we only have 8th grade comparisons here, but there is little reason to expect differences between 8th and 11th grade statewide):

Further, here is what you learn from the NAEP website about state performance standards (albeit in reading, the only ELA area in which such comparisons were made):

Mapping state standards for proficient performance on the NAEP scales showed wide variation among states in the rigor of their standards. The implication is that students of similar academic skills but residing in different states are being evaluated against different standards for proficiency in reading and mathematics. All NAEP scale equivalents of states’ reading standards were below NAEP’s Proficient range; in mathematics, only one state’s NAEP scale equivalent was in the NAEP Proficient range (Massachusetts in grades 4 and 8). In many cases, the NAEP scale equivalent for a state’s standard, especially in grade 4 reading, mapped below the NAEP achievement level for Basic performance. There may well be valid reasons for state standards to fall below NAEP’s Proficient range. The comparisons simply provide a context for describing the rigor of performance standards that states across the country have adopted.

Thus, if a student in a Memphis high school, who received a B on the last essay about the reading of a text, asks: How am I “really” doing as a writer? we cannot say. And that’s a shame.
The double problem: a single performance standard that is also merely norm-referenced.
So, as I argued last time, no matter how common it is to do so, it doesn’t make sense to use one anchor system (no matter how thoughtfully it is established) because there are too many variables in institutional and workplace demands, and across local, regional, state, and national contexts to make a single exemplar be other than arbitrary and unhelpful. Nor is it ever wise to use norms instead of defensible standards of performance. We might deceive students into either getting their hopes up too high or depressing them inappropriately about their chances for the next step by choosing one anchor of exemplary work, given inherently diverse destinations and aspirations.
What to do? Honor the truth. The real answer to the challenge of answering the question What is quality work? is the conditional answer to the question “Can I go where I want to go with these grades?” In other words, the answer is: It depends. Want to go to Harvard? Then here is how your performance stacks up. Want to go to Mercer County Community College? Different standard. Want to go to RISD or another fine art school? Completely different standard. Thus, a student writer may “meet” the standard of “real-world” writing but not be ready for a “good” college. The student may write well enough for a good local college but not an “elite” private college. The quest for an answer to the Goldilocks problem in local assessment, what started this 3-part series – what’s ‘just right’ as a performance standard? – is illusory.
What we must do instead is help students (and their teachers) know as soon as possible what various IF…THEN… possibilities require. In other words, we need to show them how different institutions would score the same piece of work and what rigor – tasks and scoring standards – is demanded by the higher levels. Otherwise, we set students up for heartache and bitterness. We see the terrible harm of failing to consider this issue at the local level now in “weak” districts where all their straight-A students graduate with honors from their high school – only to need remedial courses in college (or not even be able to get into colleges, never mind be ready for well-paying jobs).
This disconnect is even possible in myopic track programs: Recall from the earlier blog post that my daughter beat a few “best” runners from other schools by more than 30 seconds. Yet, these same runners are #1 in their school, getting A’s in track! They might easily have unrealistic expectations as runners if their coaches are naïve and if they do not run in regional and state meets. Fortunately, the many larger meets and objective times provide some reality therapy – just as standardized tests do, if we think about it this way.
The failure of the (single) letter norm-referenced grade
Where does this lead, then, in terms of a practical solution? First, to the realization that the traditional single-letter-grade system is an utter failure in a standards-based world. Clearly the letter-grade system is incapable of communicating exactly where a student is compared to any standards, and almost no teacher in American schools knows what at least every track coach knows: what the various performance standards linked to wider-world standards and aspirations are. We only know that some teacher thought the work was “A” or “B” work or whatever. In other words, teachers grade against local norms: they tend overwhelmingly to give A’s to their “best” students and go from there, no matter how absolutely strong or weak those “best” students are. So, at the very least a standards-based report card – separate from letter grades (i.e. local expectations) – is long overdue.
But as the San Diego situation indicates, merely declaring that reporting is “standards-based” without providing teachers with lots of anchors for all the standards and oversight in inter-rater reliability is doomed to fail. Teachers will revert to overly personal, isolated norm-referenced grading.
A single “standards-based” grade or score is arbitrary, however, as we just saw. There are two key reasons: an aggregrate score is an arbitrary and opaque averaging across perhaps a dozen different sub-standards, thus depriving everyone of useful feedback. It would be as unhelpful as saying: you are a B+ teacher, as if dozens of different sub-skills, attitudes, and understandings are not reported on. Thus we have no idea what your strengths and weaknesses are as a teacher when it is reduced to a single grade, no matter how “objective” the grade and anchors used – which is why all the major systems (e.g. Danielson, Marshall, NBPTS) report out against multiple standards and sub-standards.
The second reason a single standards-based score is arbitrary and unhelpful goes back to what we said earlier: which anchor, with what justification, is being used to assign the score, and is it an appropriate standard for that student and goal? If standards vary by institution, then a single standards-based grade, based on a single set of papers, is potentially arbitrary and misleading. What inevitably happens is that grades return to being norm-referenced, not standards-based.
The problem of grades is not, therefore, that we grade work. The problem is giving only one grade against one arbitrary norm.
Thus, a better (counter-intuitive) solution, is: multiple grades for the same outcomes, and outcomes divided into an optimal set of sub-standards. We want to know and you need to know your specific strengths and weaknesses in each subject, and a helpful report would translate each judgment into a variety of performance norms and standards, so that you can see where you stand in terms of the varied expectations of the places you might be interested in heading to (just as we did with Cilla and track).
For example, in writing, we could give students four evaluations on the same piece of work: a grade against local writing expectations (as we do now), a grade against regional norms, a grade against national standards as epitomized by the SAT or NAEP writing test, and a grade pegged to freshman-level writers at “elite” colleges.
Notice I am not here talking about multiple criteria but multiple standards. Yes, of course, use multiple rubrics to highlight the diverse criteria that make up good writing – e.g. voice, clarity, thoroughness, development, mechanics, etc. But though many schools, districts, and states use multiple rubrics, they still combine the sub-scores into a single composite grade. It is that practice that needs to change. The student needs to know both how they did on multiple rubrics and know how they did against multiple performance standards. (Why are we so quick to average and aggregate into one score?)
This idea of reporting different scores mindful of different standards is actually easy to do in any high school that uses AP and IB courses as well as state assessment and national assessment information. Writing, for example, could be scored against local, IB, AP, and NAEP standards, using anchors available from those sources.
Various commercial standardized tests offer other possibilities. For example, the ERB Writing Assessment Program (WrAP) test reports out student performance against a variety of useful norms: “suburban, independent, and international school stanines and percentiles are provided for each student and class through each of the five levels of the program, based on student performance within each grade.” And we already noted the DRP method of assigning an objective level to student reading ability. Finally, we have free of charge all the NAEP anchor papers and rubrics. Why not use them?
In foreign language the ACTFL proficiency standards (and now the new AP French rubrics) are widely used to frame assessments and communicate in absolute terms where students are. As a result, programs say things like “Our students will graduate at the Intermediate-Low level or better upon graduation from our program.” Indeed, this is now how the state of New Jersey organizes its framework, and educators have a set of standards-based rubrics to guide all local assessment
We would just need to train teachers to give students scores against the relevant anchors, based on relevant destinations, and organize some regional scoring sessions to help teachers get calibrated. Better yet, we could invite representatives from local colleges and employers to help us calibrate the scoring. I know of a number of high schools that have made exit-level writing tests mirror freshman entry standards for placement in local colleges, and vocational teachers have long anchored their scoring in professional standards and used people from the trades to help them validate the grading. Once teachers were trained, it would be a simple matter of providing them with packets that had work samples and “translations” of grades and scores for the varied contexts.
States should therefore report out such tiers of excellence, just as in track, linked (hopefully) to colleges in the state. Instead of just providing arbitrary cut scores and norm-referenced judgments, they should anchor their tiers in valid scores that align with college entrance conditions across the spectrum of colleges.
This is no abstract argument. The failure to get beyond state norms in testing is setting up a disaster in the making, as Arne Duncan and others have been pointing out for years now: NAEP (and PISA and TIMSS) scores already show us and have shown us for years that two thirds of American students cannot meet the “high” standards of NAEP. States have been setting the bar low for decades.
In short, it is time to have helpful standards-based reporting in a standards-based world. But not a single arbitrary standard, because there isn’t a single standard. Standards are contextual, as the argument reminds us. So, state and local educators should generate a rich set of scores related to varied standards of performance reflective of the diverse aspirations and destinations of our graduates. Then, at least students will have the information they need to make wiser, more informed decisions, given their aspirations and performances.

Standards, Part 3: The Arbitrary Single Grade or Score

No responses yet

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Subscribe to the AE Newsletter