You are on page 1of 5

Articles

Quality in assessment: what makes a good test?

Published on Tuesday, 27 March 2012 13:25

When we think about quality the quality of a product, such as a camera, say, or a hair drier what factors spring to mind? Probably the first thing is whether it does what its supposed to do, and how well. The camera takes good photographs; the hair drier dries your hair quickly and comfortably. Secondly, you might think of reliability: the camera will take good photographs all the time; the hair drier will dry your hair over and over again without breaking down. In this respect, quality is what gives you confidence in the product. When it comes to testing we have to ask what it is that tests actually do. The answer is: they give us a measure of someones ability. From this perspective a good test is one that gives an accurate measure, so we can be sure that someone who obtains a high test score has more ability than someone who obtains a low one. This may sound obvious, but in fact it begs some important questions: most critically: what exactly is the ability we are trying to measure? With language tests, most score users the people who have to make decisions about whether someone should be admitted to a university course, say, or a job - are interested in someones ability to use the language in real life, to follow a course of study, to do business with overseas clients, to give information to tourists, to work in an international company not in their ability to do tests.

So a good test is one that really tests the ability that it claims to measure. This is a complex issue, as many aspects of real life language ability are difficult or impossible to reproduce in the exam room. However, a good test, especially one for general English, will use a variety of different item types to reflect the range of skills and subskills that test takers will need. As well as multiple choice questions (which provide a simple and reliable way of measuring some aspects of language ability) it will include more productive items such as open ended questions and, for assessing writing ability, free writing tasks. Take our general English test PTE General for example, we include two such tasks one transactional and the other more expressive in order to achieve a more representative coverage of the types of writing test takers should be able to produce when they use the language in their work or study. The same holds for speaking. For most real-life purposes it is an important skill, one which employers are especially interested in. And yet there are tests which purport to give an overall estimate of someones language ability without testing speaking at all. The reason for this is simply that testing speaking is difficult. It cant be done with pencil and paper. You need the time and expertise of trained assessors (as with PTE General) or else highly sophisticated technology (as with our academic English test, PTE Academic). Of course, if someone can obtain a high score in listening, reading and writing it is likely that they will have some ability in speaking, but without testing them we cannot know how much ability. As teachers we can all think of students who have an uneven skills profile - who are strong in some skills and weak in others. As with writing, to assess speaking in PTE General we require test takers to perform a variety of tasks: not only interview and picture description activities but also, through the role play, to show how they would react in situations which occur outside the classroom. Clearly reliability is also an essential ingredient of quality for test score users. They want to have confidence that the expected relationship between test scores and ability will always hold. There are many numerical indicators of reliability in testing. The most basic is test-retest reliability, which is a measure of what happens when the same individual takes the same test twice (ideally they will obtain the same score).

At Pearson we use measures of reliability not only to monitor the quality of our tests in general but also to track the performance of individual items. Items that are shown not to contribute to overall test reliability are withdrawn. For teachers we can envisage another mark of quality. Students rightly expect their teachers to prepare them for tests, and teachers are well used to undertaking test preparation activities with their classes. Different tests naturally call for different preparation activities. If a test is made up entirely of multiple choice questions there will be strong pressure on the teacher to concentrate on constantly repeating these items in class, at the expense of more interesting activities. From this perspective a good test is one that encourages teachers to deploy activities that they find stimulating and that will benefit their students in the long term. This is known as the washback effect. Basically it is the effect that a test can have on students, and their teachers, before they have even taken it. A quality test should have a positive washback effect, as it will promote learning activities which prepare learners for real language use and not just the immediate demands of the test. Glyn Jones, Senior Researcher Pearson Language Testing

What Makes a Good Test? http://www.princetonreview.com/corporate/what-makes-a-good-test.aspx

It Measures What It Purports to Measure This should go without saying. However, tests like the SAT prove that this isn't always the case. For many decades, the test was the "Scholastic Aptitude Test," and it purported to measure intelligence. We helped push the College Board to admit that the SAT did no such thing and the test was renamed in 1994 (fittingly, the acronym "SAT" now stands for nothing at all). The College Board also promoted the idea that the SAT measures high school studies in a way that eliminated differences in grades across schools and classes; The Princeton Review helped show that the test measures almost nothing taught in high school. Finally, the test's advocates claim it predicts college success. In fact, the correlation between college performance and SAT scores is weaker than that indicated by high school transcripts and not much better than family income and other purely socioeconomic indicators. It's Unbiased Highstakes tests should be unbiased. That doesn't mean that every demographic group should score equally well; it simply means that similar students should achieve similar scores. Women score 40 to 50 points lower on the SAT, for example, than do men, though they have better grades in both high school and college. Since SAT scores determine both admissions and scholarship/financial aid awards, women are doubly penalized. At the same time tests can help highlight unequal outcomes, as with the provision in No Child Left Behind that requires states to report scores separately for each subgroup of students (boys and girls, rich and poor, white and nonwhite, etc.). This disaggregated data helps expose systems that are failing the kids most in need. It's Fair, Open, and Has Reasonable Policies More than 20 years ago, New York State Senator Ken LaValle promulgated the first "TruthinTesting" laws. They required that any admissions test given in New York be subject to independent review of its content and scoring, that test items and scoring be publicly released, and that there be due process for any student accused of cheating or other irregularities. Although the Educational Testing Service (the designers of the SAT) lobbied hard against the laws, claiming they would raise the cost and difficulty of administering the test so greatly that they would no longer be able to give them in New York, no such calamities took place. In fact, the testing companies have since stated that TruthinTesting has made their tests better. We believe that all highstakes tests, including those used for K-12 accountability, should be subject to similar rules. It Promotes Good Education

Highstakes tests are more than snapshots or benchmarks. They are powerful motivators of whatever behaviorgood or bad will lead most directly to higher scores. You can see this demonstrated when some schools drop recess in favor of narrow drillandkill practice sessions, when others provide their teachers with highquality professional development designed to improve teaching and learning, or by the hard work kids do in our test prep courses to learn skills that are useful only on tests.

Psychometricians tend to focus on a test's accuracy, precision, and reliability. More important than those, though, are the teaching and learning behaviors that the test promotes. The best way to prepare for an essay test is to write a lot; not surprisingly, teachers in states giving essay tests have their students write more. Though it may be no better than a multiplechoice test at assessing writing skills (and somewhat more expensive) it has the inherent advantage of actually getting teachers to get kids to write.

Test Preparation Should Be StressRelieving, Not StressInducing

If you give highstakes tests, people will prepare for them. The question is whether that preparation is efficient, equitable, and effective and does not distort or disrupt the larger learning context. Our twentyfive years of test prep experience has taught us that, above all, there is a time and a place for it, ideally one that is as minimally intrusive and timeconsuming as possible. When we work with schools, we show teachers how to avoid test prep and instead focus on the real work of education that keeps their classrooms from becoming dominated by deadening exercises intended primarily to raise test scores. Assessment should be the tail, not the dog. From the outset, we've made sure that the fruits of our research and development have been widely available through inexpensive books, free distance learning, and courses underwritten by The Princeton Review Foundation. And we work with thousands of schools at every socioeconomic level to help them deal with testing and admissions issues, almost always at a cost savings relative to their existing approaches, and with better outcomes. Finally, we measurably improve performance. We hire outside firms to assess our results, and participate in every thirdparty study proposed to us. We encourage our customers to ask usas well as our competitorsfor documentation of any performance claims Through all of these approaches, and with a consistently honest voice about testing, we've tried to relieve, rather than stimulate, anxiety.

You might also like