In the first post in this series, I defined what validity is and why it is more tricky to determine than one might think. I also warned that many educators do not understand how test-makers judge validity, and to beware of just looking at any test question to determine it. What matters is the link between the question and the goal of the assessment that uses the question (and the pattern of results, as noted in post #1).
In this second post I want to explore the challenging idea mentioned briefly in post #1: A test question can be inauthentic – even somewhat trivial or weird – but still yield a valid inference about meeting a goal or not. And vice versa: a question can by itself be highly authentic as activity but be completely invalid for the goal it supposedly measures. Let’s start with the 2nd point.
How authentic assessments can be invalid. We regularly tell a now-legendary story in UbD training, something that happened to me in a workshop in Virginia almost 2 decades ago. We were working on designing assessments for understanding goals, and a woman comes up to me and says:

Can I try an idea out on you?

Sure, I said.

I want to use a learning activity that I love and the kids love – for an assessment; can that work?

Sure, I said – if the activity is a valid reflection of the goal.

Well, that’s what I want your feedback on. My goal is the standard for the War between the States in 5th grade in Virginia’s SOLs: “Students will understand the causes and effects of the Civil War, with an emphasis on military, economic, political, and social history.”

OK, I said – but that’s a big, messy Standard (and for 5th grade, jeesh). What is the activity?

Well, the kids do a diorama of what they believe is the most important battle of the War. Then they serve as docents to their exhibit in a ‘museum’ where they answer questions from other students and explain the battle and its importance. What do you think?

Hmmm. I am not going to tell you. I’m going to put 2 questions on the overhead [this was in the days of acetates!] and ask you to self-assess your proposed design.

The acetate had the following 2 questions on it, which we call our quick-n-dirty 2-question test of validity:

        • Could the students do a great job on the task, but not meet the goal of the assessment?
        • Could the students do a poor job on the task but still provide lots of evidence that they can otherwise meet the goal of the assessment?

Then the proposed test is likely invalid for that goal

 Oh, my, she said with an embarrassed look.

Clearly her idea doesn’t work – and she saw it. Yet, that is really GOOD news. It means that with a little self-discipline, we can all better self-assess and self-adjust our assessments for validity.
Do you see why – no matter how educational, fun, and interesting the project was – that the assessment is not aligned with the goal being measured?
In fact, the more you think about it, the worse the project looks: a battle is a single event; the Standard asks students to understand the (multiple) causes and effects of the war over time.
A useful follow-up to a 2-question-test ‘failure’ is then to ask two further questions that follow:

      • What goals might this project be used to validly assess (if any)?
      • How might the core of the project be saved, to validly assess the original goal?

The project can certainly be used for considering the question: which were the most important battles of the war and why? And it can be used to get at important language arts goals related to explanation and presentation. But it would have to be changed to measure the original goal of showing understanding of cause and effect for the war.
How might the project be tweaked to save it for the goal? Think: timeline of the war in a museum, instead of single event. Suppose the task were changed to having students propose a timeline and sample exhibits that would go in a Civil War Museum. What would be on the timeline and why? That is clearly much closer to the goal – and has the added benefit of involving different media for differentiation and validity purposes, but without compromising the goal. (You can read lots more about assessment in unit design here and here.)
Confound it!  Most psychometricians are wary of performance assessment for reasons that are easy to understand, in light of the above story. In a complex performance, we have all sorts of confounding variables. What does ‘confounding’ mean? It means just what the 2-question test is trying to get at. If you can do well or poorly on a test for reasons separate from the thing we are trying to measure; or if we cannot tell why you succeeded or failed, then there are some other powerful variables at work in the test and validity cannot be established. Which is not good for testing, even if complex work is good for learning. As I said last time: we have to learn to think like an assessor here.
Recall the problem with the 4th grade essay on the Feds’ monetary policy, mentioned in the previous post. The knowledge of economics is a confounding variable in figuring out if the student can write essays – which was the goal of the test.  Let’s apply the 2-question test again:

    • Could the 4th-grade student write a great essay on the Fed’s policy without being able to write well? Probably not, unless the teacher was so overwhelmed by their technical knowledge that she gave a high grade in spite of weaknesses in writing.
    • Could the 4th-grader write a poor essay on the Fed’s policy while still being a good writer? Most definitely yes – hence, it fails at least half the 2-question test.

Assessing reading by writing: problematic. Assessing reading by speaking: problematic. This is why it is vital for teachers of younger students especially to use many media  for gauging content knowledge and understanding. An inability to write or speak may be preventing us from seeing how much they really know (and vice versa: very articulate children are often over-rewarded for the content knowledge).
But Grant we want all students to be good writers! 
Of course we do. And if the goal is good writing for a specific assessment, then write they must. But if the goal is “understands the causes and effects of the Civil War,” and only writing is used as the medium of testing, then we have a problem – it will likely fail our simple validity test.
Well can’t we measure both at the same time? Can’t I assign an essay on a book we have read, for example?
Well, you can, but the knowledge of the text (or lack of it) surely affects the quality of the writing. And vice versa: the knowledge of the book may be outstanding, but the writing be so poor that it makes understanding look weaker than we know it to be from talking to the student and observing her. Even if we use separate rubrics for ‘understanding’ and ‘writing’ the problem remains – if the writing is so weak as to undercut our grasp of the students’ understanding.
Which, not so co-incidentally, is why test-makers love multiple-choice questions and dislike constructed-response and – especially – complex performance tasks. M C questions bypass the problem of the medium of expression (as long as we are allowed to assume that everyone can read at the level of the test question – not always a great assumption) and the confounding variables that are inherent in most performance assessment situations.
That’s why we call simple test questions ‘items’ – it’s like the scientific method. Control for one variable and one only – a specific piece of knowledge or skill that (we think) correlates to more complex and messy performance (as in the example in the previous post, of using vocabulary to assess literacy levels). That’s why measurement people will tell you that good MC tests are far better than other kinds of tests in many instances.
The question of the pineapple. Which, finally, brings us to the (in)famous pineapple questions on the NY State ELA 8th grade test from 2012. Look, it was a bit weird and post-modern for an 8th grade standardized test, for sure. Nor is it at all clear why the test-writers altered the key character to be a pineapple instead of an eggplant, as it was in the original story. And, boy did the media initially fall over the story – and mostly get it wrong. (The follow-up NY Times article is probably the most interesting because it also followed the sidebar of the author getting flamed online by kids for ‘selling out’ to Pearson.)
That said, I think the question measures what it is supposed to measure – reading ability; especially the ability to see the piece as a satire of Aesop and to not read it way too literally (a propensity of immature MS readers). I suspect – especially – that when you look at the results, the best readers got it right for the reasons mentioned. The best readers would get the satire of it and have little difficulty making the best of the choices offered. On this E. D. Hirsch and I agree!
Alas, we won’t know of course if the results were valid, since Pearson isn’t saying – and New York threw out the result on the questions anyway.
Perhaps we’ll let the original author of the text used in the test question, Mr. Pinkwater, have the (amusing) last word:

You bet I sold out, I replied. Not to the Department of Education, but to the publisher of tests, useless programmed reading materials, and similar junk. All authors who are not Stephen King will sell permission to allow excerpts from their books to have all the pleasure edited out of them and used this way. You’d do the same thing if you were a writer, and didn’t know where your next pineapple was coming from.

The psychometrician has an unenviable life, alas. Often “eaten at the starting line.”
What should you do to insure that the inherent messiness of performance assessment doesn’t render evidence and claims about goals invalid? Why should we be suspicious of many validity studies of conventional multiple choice tests? That’s the subject of the third and final post, next time.

Categories:

12 Responses

  1. I’m finding these validity explanations very clear and helpful. Thank you!
    It seems that there is a standing tension between assessing and teaching, wherein assessment seeks to limit complexity while teaching seeks to increase it.

  2. Excellent read. Having worked with Jay McTighe last year and Dylan Wiliam this year it has become really clear how much better we can become at using good assessment tools to improve learning.

  3. A couple months ago I created multiple answer tests. (http://educationrealist.wordpress.com/2014/01/17/multiple-answer-math-tests/) I’ve been very happy with them, although they are challenging to grade. Eliminating wrong answers is every bit as much a right answer as selecting right ones, something that is true to a lesser degree in multiple choice tests (which I also like).
    I feel that the NAEP reading test dramatically conflates writing and reading, for both math and reading. As a result, I’ve always felt it understated reading ability, particularly given its low stakes nature. But I rarely see any criticism of it on that grounds, and was wondering why.

    • I agree with you on the multiple answer tests: they can reveal a lot of understanding or lack of it. There are even additional wrinkles: as my colleague Nicole has done, you can ask students to identify the ones they are sure are wrong as well as write, with a number sating degree of confidence.
      As for NAEP results, NAEP certainly demands more writing than many tests, but I’d have to look more closely at the items than I have in a while to gauge the problem. Since so many of the questions involve very short answers, and since there is some built-in redundancy with MC questions, I’d be surprised if, on the whole, it was as big a worry as you fear. But I’ll happily look into it further.

  4. I wonder if we should also have more peer review of questions to ensure validity. I think we all get “tunnel vision” and can easily miss the fact that the way we ask the question assumes students can answer it that way to show understanding. Maybe some differentiation of test question formats, with peer review to ensure those questions are relevant and valid, could be the best plan.
    This reminds me of of the courtroom saying, “Assumes facts not in evidence.” We assume the students can write, give a speech, create a higher-level tech projects, draw, etc. when answering a question or working on a project.

    • This is a central recommendation that I make to schools and districts that want to be mission-based and standards-focused. There HAS to be peer review of the major tests, to ensure validity of the questions and appropriate rigor of the scoring standards. Central to the history of the rise of standardized testing is the lack of validity and reliability of local assessments. This is a point that I have been making, in vain, for 30 years. Quality control in local assessment is the essence of professionalism.

Leave a Reply

Your email address will not be published. Required fields are marked *