Reliability

March 5, 2014

The blog gods are very fickle. I had almost put the finishing touches on Validity Part 3 – in which I offered a lengthy discussion of the SAT – and then the College Board today announces an overhaul of the test. And interestingly, the chief rationale is validity:

But beyond the particulars, Mr. Coleman emphasized that the three-hour exam — 3 hours and 50 minutes with the essay — had been redesigned with an eye to reinforce the skills and evidence-based thinking students should be learning in high school, and move away from a need for test-taking tricks and strategies. Sometimes, students will be asked not just to select the right answer, but to justify it by choosing the quote from a text that provides the best supporting evidence for their answer…. Instead of arcane “SAT words” (“depreciatory,” “membranous”), the vocabulary words on the new exam will be ones commonly used in college courses, such as “synthesis” and “empirical.”

So, let’s consider reliability, the partner of validity. It’s very easy to confuse validity and reliability, and many articles on testing conflate the two.
Recall we said that the validity question is: does the test measure what it claims to measure? I.e. do results on specific questions that sample the vast domain predict or correlate with results at the larger goal or total domain?
What is reliability? The reliability question is different. Whether or not the test is valid, are the scores stable and consistent, with error minimized? Or was this particular score an outlier or anomaly (should the test be taken repeatedly)? In other words, what is the “true” score? It’s the same question we worry about in national polls: what’s the “true” % for and against, mindful of margin of error?
Speaking of the SAT, a great bar bet is to ask people the margin of error on an individual score on the test. If we leave out the writing, you may be very surprised to learn that the margin of error on the SAT is plus or minus 32 points out of 1200. In other words, statistically speaking, if you take the SAT three times, and you get a 560, a 580, and a 600 for your three scores on the Verbal section, these are basically all the SAME score when you factor in margin of error.
As I hinted last time, reliability is a big problem on complex authentic performance. Even people who are reasonably competent at a complex performance – e.g. writing essays – may have scores that vary considerably over time, as any English teacher knows.
This is true even in professional sports where we expect great consistency in performance over time. Here are the National League West baseball standings from last April 30th:

Here are the final standings on September 30th:
Oops: the Rockies went from first to worst. Their first month of baseball was not a reliable “score.”
Note, therefore, that the “test” of a baseball game is as valid as any test can be: the goal is winning baseball, at the major league level. By definition, that means playing the game of baseball against other major league teams. So validity is near perfect. But there is great unreliability in any single game or even a few weeks of games, as last year’s data reveal. Reliability isn’t really established until well into the season of many games where the “true” score of ability of a team reveals itself.
And the same is true of individual hitters. A batter may go 0-4 in one game. Then, a week later he goes 3-4. Is either “score” reliable in the end? No. The batter will likely end up hitting around .250 for the year (i.e. more like 1 for 4, on average).
The fear over reliability among large-scale test-makers. Well, this is a potentially HUGE problem for test-makers. If Johnny’s “true” score – like that of the hitter – can vary wildly from day to day and test to test, then we cannot have much confidence at all in the results of a single test, can we?
The ugly fact is that the tests are MADE reliable by using redundant and simple questions – and fancy psychometrics – that beg the deeper question: how do we know that academic achievement is a stable thing, especially in a novice? Maybe it’s more like baseball in which, even at high levels, a single performance result is sufficiently unreliable that it is unwise to use it to make a big judgment. Fine, maybe an answer to the question 2 + 5 = yields reliable answers, but it doesn’t follow from such simple and unambiguous questions and their answers that a student’s genuine level of (complex performance of) mastery of arithmetic is stable from day to day. Especially if we were to start using multistep open-ended problems that demand transfer.
Indeed, this idea about margin of error, and thus humility about the meaning of results, is enshrined in the AERA/APA/NCME Standards of Measurement, and has been for decades: never make a huge decision on the basis of a single test score because of both reliability and validity issues. It violates the ethics of measurement to do so. (Alas, the Standards are being revised and only the older forms are available; and not available for free anymore. I searched but found none.)
That’s what makes the current one-shot high-stakes test situation in education so untenable. Judgments about students and teachers are being made on the basis of a single result. It’s wrong, and people who know better in the measurement community know it’s wrong, and ought to be up in arms about it.
The irony is that critics of testing typically claim that the tests are not valid. But that’s where they probably go wrong. The better argument concerns the questionable reliability of a single high-stakes score.
It’s no accident that the World Series is best four out of seven. Can’t kids and teachers get a similar shake?
But before we go with pitchforks to state education buildings, ACT and ETS, the same argument applies to YOUR tests and quizzes. What is the margin of error on a 20-question quiz in arithmetic? Most likely the answer is around plus or minus 3 points. So a 14, 17, and 20 are the same score – just as on the SAT.
In short, before the pot calls the kettle black, let’s look at local assessments carefully for validity AND reliability.
What you can do as a teacher. Reliability is about confidence in scores/grades, where error is minimized. And error is best minimized by using multiple measures, at different times. Although I find the Saxon math books pretty dull and dreary, their quizzes and tests are built upon this idea as well as the research on spaced vs. massed practice. In other words, you don’t just test the content once or twice right after having taught it. That is probably going to lead to very unreliable results in your grade book.
In addition, it’s best to use multiple measures, for both reliability and validity. Vary the format: use multiple-choice, short-answer, oral questioning, and projects, with redundancy on what is assessed. Include student self-assessments. And for every complex performance, use a complementary quiz on the same content.
Curve results on tests, as most college professors do (and should), when the results are out of whack with patterns that have been established.
Avoid overly-precise-seeming scoring systems. Giving a student a 72 on a history paper is poor measurement. Better to give a 3 out of 4 or something similar. If you know statistics, then report the score in a box and whisker or confidence interval form.
PS: The article in the NY Times on the history of the proposed changes in the SAT is MUST reading!

16 Responses

Brenda Ellis says:

March 5, 2014 at 6:37 pm

I sought your blog on the Internet because you were one of the go-to authorities whose ideas we respected and studied to guide our programs when I worked for Region VIII Education Service Center in Texas. The College Board, on the other hand, is focused on its bottom line.

Reply
Arpan Chokshi says:

March 5, 2014 at 7:35 pm

Excellent and thought provoking post as usual. I agree that a few assessments early in the year cannot reliably determine a student’s performance. However, what concerns me about my students is how reliable students are at producing consistent work. In other words, after a couple months of school I can fairly reliably predict how well most of my students will do on a paper or project(Or which students may not do it at all). Even when using a rigorous and objective rubric or peer grading to minimize confirmation bias. I know this is true for many other teachers and it bothers me.
What can we do to ensure that students, who turn in reliably low performances on assessments, are able to make real and significant progress?

Reply
- grantwiggins says:
  
  March 5, 2014 at 8:45 pm
  
  Well, that’s a different and troublesome issue. It is indeed predictable that some will simply fail to engage and advance, especially in secondary education. But that may be a function of forces out of your control: they simply don’t want to risk ‘playing the game’. Maybe, then, the challenge is to change the grading system so that kids are rewarded for making progress; and so that the assessment system puts the pressure on them to identify challenges they wish to take on from a menu.
  
  Reply
Jessica H. says:

March 5, 2014 at 10:34 pm

Thanks for this posting.
I’m curious about research or summaries thereof on the influence of time allotted for test-taking on the reliability of standardized tests of various kinds, particularly the SAT/ACT.
While I understand that IQ tests and the Army Alpha test considered time/speed as a vital factor in measuring cognition/ability–as do some theories of intelligence–beyond cost considerations, I’m unclear about why the SAT/ACT can’t or shouldn’t be virtually un-timed (i.e., give more than enough time for completion…say 5-6 hours versus just under 4). Are there studies that show that the time would have virtually no impact on the outcomes for individual or sub-groups of test-takers? Does it have little to no influence on reliability estimates?
Let’s take the redesigned SAT for example: If the redesigned test truly does attempt to approximate real-world thinking and tasks, then doesn’t it makes sense to do away with the artificial time constraint? Yes, people in the real world work against deadlines and need to have some degree of automaticity when it comes to applying their skills to novel situations. But usually not with the amount and kind of content students are asked to consider under timed on-demand testing conditions.
I’d be interested in your or anyone’s else thoughts.

Reply
- grantwiggins says:
  
  March 6, 2014 at 7:25 am
  
  Great questions, Jessica. At one time I am sure the time constraint was deliberate. It was used to simulate the challenge of working on blue book exams within the 2 or 3 hours typically given back in the day. I recall there was great reluctance on the part of the College Board to do away with time limits for special needs cases, as a result. Then – as i recall – their own data sowed that the additional time didn’t matter much, so they relented on special needs. That naturally led to an increase in special needs case. i recall being told by a faculty member at one prep school that HALF the class was designated special needs – parents started to clamor for the designation.
  The other issue is the very meaning of the word “standardized” test. What most people forget is that this means that the conditions of testing are identical for all takers, not just that they get the same questions. That is key to its validity. If the proctors make local decisions that are not standard, the test could be compromised. This is why there is a strict protocol on how to respond when kids ask the proctors questions.
  I also know that the NAEP writing sample has very good validity – even though it is only 20 -30 minutes. Here, again, as in my earlier posts, we see the desire on the part of the tester to have efficient proxies instead of, say, allowing a kid to write over 3 days as in the “real” world. Obviously, the test would potentially be completely compromised if the students were allowed to write over 3 days. As has happened with the so-called college essay.
  But I am unclear on what current ETS data show on your question; I’ll look into it on the College Board site under research.
  
  Reply
  - Jessica H. says:
    
    March 6, 2014 at 10:36 pm
    
    Thanks.
    Yes, standardized testing refers to the conditions (e.g., time, site) as well as the questions, proctor protocol, etc. But the principle of “more than enough time” can still be applied in a standardized situation. For example, in the case of the paper-based Illinois Teacher Certification Test, examinees are given five hours to complete the test. (Most people complete it in FAR less time…like 2-3 hours, if that.) You simply leave when you’re finished. Side note: Interesting that this “high-stakes” test for teachers is far more permissive with time that the “high-stakes” tests for students!
    I agree agreement that as long as a writing prompt is well-designed, it’s more than possible to capture valid evidence of a student’s writing ability without giving infinite amounts of time. I’m not at all suggesting that students be given the whole day.
    Most importantly, the amount of time should “match” the kind of prompt it is. So, NAEP gives 20-30 minutes, but the prompts are divorced from the analysis of texts that all test-takers have read. They’re in the vacuum of the test-takers experience. By contrast, the sample Research Task that Smarter Balanced has developed for grade 9 is designed for 85 minutes (I think), but the students have to synthesize/draw on information they’ve read and viewed (e.g., a short story, a video, and some graphical data).
    
    Reply
    - grantwiggins says:
      
      March 7, 2014 at 5:49 am
      
      Yes, all those changes are for the better in Smarter Balanced. NAEP, of course, is no-stakes and an intrusion, so I think they deliberately minimize the amount of time taken to ensure good participation from kids and schools.
      
      Reply
gasstationwithoutpumps says:

March 6, 2014 at 1:30 am

The statement of SAT margin of error is really only true of the middle of the distribution. At the high end (where the selective schools make their decisions) the margin of error is much larger, due largely to the paucity of questions for distinguishing among top students.
Incidentally, I blogged my impressions of the announcement of the new SAT test today at http://gasstationwithoutpumps.wordpress.com/2014/03/05/sat-is-changing-in-2016/

Reply
barrylane55 says:

March 6, 2014 at 12:47 pm

Great point, Grant, about the fallacy of a single score on a standardized test being a problem. A while back certain congressmen made an attempt to eliminate or reduce the Head Start programing, claiming that by third grade tests showed that children who were behind had caught up naturally, without early intervention. Fortunately, someone else had a study that showed children who had gone through Head Start where more likely to move up in society economically over a 20 year period. That long range view of children was a
Tests miss so much that is human and yet pretend to judge our human intelligence.Lately, I wonder what would happen if we just got rid of them all. That is an essential question for Pearson ,perhaps. Why do we need these narrow instruments?
I think that the sordid history of standardized testing and its attempts to prove or disprove the eugenics had evolved into a more benigh simple way of sorting people into winners and losers. Coleman in crew never seem to question that assumption. Without these standardized tests we would have to think of a better, more complete way of judging intelligence. Alternative assessment would be the only alternative and it would therefore be no longer alternative but essential.
(disclaimer. I have an iq of 53 because I refused to read the questions in 3rd grade and just filled in the bubbles. )

Reply
- grantwiggins says:
  
  March 6, 2014 at 1:30 pm
  
  HA! (on your IQ score). The history of testing is fascinating – a mix of the benign and the evil. As you note, eugenics was a partial motivation of Terman. However, Binet and the SAT founders were motivated by concerns about opportunity and fairness. Yet, no one anticipated how they would come to dominate our lives in this way.
  That’s always been the question for me. Why have we permitted this dominance? Or, put the other way around, in whose interests are such tests? When you consider how weak a predictor the SAT is, for example, why did it become a multi-billion dollar enterprise? Why didn’t the COLLEGES pay for it as opposed to the applicant? Why do state septs. sign deals with the companies that enable the companies to keep and re-use the very same items in other states – while making top dollar on the tests?
  I wrote a fair amount on this 25 years ago. (See Assessing Student Performance and Educative Assessment). But even I still puzzle on why we let outside companies, with little oversight, call the shots year after year.
  
  Reply
barrylane55 says:

March 7, 2014 at 11:07 am

Ah yes, the portfolio days in Vermont. I remember them well. I was a network leader. Rick Mills would claim that portfolios were working by quoting standardized test scores. Ugh. Even Diane Ravitch quotes test scores to prove her point that high stakes tests don’t work. We are prisoners of this false metric.
True education reform should address this, not simply redecorate the same tests with new questions and increase the pressure on all involved. Its the WALLS not the wallpaper that we need to be rethinking.
I am making a documentary about this called “What are Schools For? and would love to have your voice in it. I interviewed Noam Chomsky recently and he made the cogent point that they don’t do tests at MIT, the premier research institution in the world, no tests. Tests are ok for narrow stuff , according to NC but have no purpose in assessing real learning. Money could be so much better spent elsewhere.
My guess is that the whole thing is economically driven by the educational industrial complex, otherwise known as Team Coleman, but these companies could not thrive if the truth about high stakes testing became widespread, and the tide might be turning. Even now there is this scramble to own formative assessment, because of the backlash of parents against these high stakes assessments and the time and energy they take away from student learning.
Thanks for the work you have done to turn the tide.

Reply
- grantwiggins says:
  
  March 7, 2014 at 7:26 pm
  
  And Rick then went to NY where he took a completely different path because he felt burned by the portfolios. That whole story is pretty interesting because it underscores the difference between assessment as conceived by teachers and assessment as conceived by policy people. As I know from first hand experience, many teachers chose samples of work that were “interesting” and indicative of “growth” which is not really what Rick wanted or what the samples were supposed to be according to the guidelines. The deal had always been that Rick (and Ross Brewer) could go down the portfolio road IF it was valid enough to satisfy the business and professional community. But teachers ended up implementing a system that reflected their own interests – which led Rick to conclude that, in NY, he wasn’t going to make that mistake again of leaving it to the educators. And so he didn’t.
  
  Reply
  - barrylane55 says:
    
    March 8, 2014 at 8:37 am
    
    aha now I get it. I was wondering what that transition was about. Business needs seem to always trump educator needs. I created the Vermont Portfolio Institute back then and did workshops around the country. The audience was composed of teachers who loved portfolios and were using them daily and administrators who only wanted to know exactly how to grade them because they were told to by a higher power.
    
    Reply
    - grantwiggins says:
      
      March 8, 2014 at 8:42 am
      
      Well, I have some sympathy for Rick and the policy people. They felt that the portfolio could satisfy the need for data while enabling teachers to have useful information. But the two purposes collided when teachers started being creative about what pieces to include and how to asses them. I think these kinks could have been worked out, but they never were.
      
      Reply
Adam Percival says:

March 7, 2014 at 1:10 pm

Hi Grant,
Enjoyed this article- great points! One little thing- I think it’s pretty easy to be misled by the statement that “a 14, 17, and 20 are the same score”. It really only makes sense if we’re talking about a single student taking the assessment multiple times, in that we may not be seeing a real difference in achievement. To put it another way, if you have three students, one who scored a 14, one who scored a 17, and one who scored a 20, the “14” means something very different from the “20”. The “14” student could have a margin of error spanning from 8-14 just as easily as from 14-20, whereas for the “20” student a 14 is the absolute lowest score we’d ever expect within the margin of error… and that’s assuming they’d score 17 on average, rather than something higher.
Another little bone to pick is that the order in which you receive the scores is probably significant… “560, 580, 600” on the SAT means something different than “580, 600, 560” (likelier in the first case that you are improving over time).
So, point being, definitely agree that making major decisions based on a single score at a local or state level is poor decision-making. It’s a solid bet that a student scoring 20 in this example has a good handle on the material, though. So it’s probably more of an issue near some kind of “cut-off” point for decision-making, and maybe re-testing should be focused there.

Reply
- grantwiggins says:
  
  March 8, 2014 at 7:06 am
  
  Adam – excellent clarifications all. The order may well matter, and a 20 certainly suggests mastery. My caution really amounts to relying on a single score. A cut-off and confidence interval thing, which I don’t think teachers have been helped to consider. Put differently, many teachers who use a 100-point marking scheme, and SAT scores are sacred – everyone is easily seduced by the false precision of it all.
  
  Reply

Reliability

16 Responses

Leave a Reply Cancel reply

Recent Posts

Recent Comments