You are on page 1of 61

Fundamentals of Language Assessment

Christine Coombe & Nancy Hubley

Table of Contents

Introduction ..................................................................................................... 3 About the Authors. 3 The Cornerstones of Testing ........................................................................... 4 Cornerstones Checklist 8 Test Types ..9 Test Development Process ............................................................................ 13 Guidelines for Classroom Testing .................................................................. 15 Writing Objective Test Items .......................................................................... 17 Assessing Writing .......................................................................................... 20 Assessing Reading...23 Assessing Listening..26 Assessing Speaking.31 Student Test-Taking Strategies...................................................................... 33 Statistics ........................................................................................................ 37 Sample Classroom Statistics Worksheet ....................................................... 40 Technology for Testing .................................................................................. 41 Internet Resources42 Alternative Assessment 43 Testing Acronyms.46 Glossary Of Important Testing Terms ............................................................ 47 Annotated Bibliography.................................................................................. 56 Contact Information........................................................................................ 60

Coombe/Hubley Fundamentals of Language Assessment

Introduction
The past ten years have seen a number of developments in the exam production and evaluation process. Although the specialist testing literature has burgeoned in this decade, information about recent developments and issues has been slow to filter down to the classroom level. This workshop provides an opportunity for teachers and administrators to explore and evaluate aspects of the testing process and discuss current issues. Fundamentals of Language Assessment focuses on the principles of test design, construction, administration and analysis that underpin good testing practice, but this session presupposes no prior knowledge of testing or statistics. Participants will be provided with the essential theoretical and practical background that they need to construct and analyze their own tests or to evaluate other tests. During the session, basic testing techniques will be covered in a brief presentation immediately followed by practice in small groups. Participants will receive a kit of materials which they will actively explore during the workshop and later take away to guide them when they apply their new skills. In addition to becoming familiar with standard approaches to classroom testing, participants will be introduced to alternative forms of assessment such as self-assessment, portfolio writing, and student-designed tests. The organizers believe in learning through doing. A key activity in the workshop is reviewing and critiquing tests that exemplify good and bad testing practices. Through this activity, participants will acquire the useful skill of fixing bad test items and salvaging the best aspects of a test. Experience has shown that teachers value something that they can put into practice, something that immediately enhances their skills. We hope that attendees will leave the workshop equipped with a heightened awareness of current testing issues and the means to put them into practice by creating effective classroom tests.

Coombe/Hubley Fundamentals of Language Assessment

About the authors


Christine Coombe teaches English at Dubai Men's College and is an Assessment Leader for the UAEs Higher Colleges of Technology. Nancy Hubley formerly worked in these roles for HCT. In 1997, Christine and Nancy founded the Current Trends in English Language Testing (CTELT) Conference, now an annual international event. They frequently provide assessment training and serve as English Language Specialists for the U.S. Department of State. Together, they edited the Assessment Practices (2003) volume for the TESOL Case Studies series. Their co-authored volume entitled A Practical Guide to Assessing English Language Learners was published by the University of Michigan Press in March 2007. Christine is the Past President of TESOL Arabia, and the winner of the 2002 ETS Outstanding Young Scholar Award and the Mary Spann Fellowship. Her current research is on test preparation strategies. In 2006 she chaired the TESOL Convention held in Tampa, Florida. In 2001, Nancy received the HCT Chancellors Award as Teacher of the Year. In 2003, she was part of the writing team for the nationwide Grade 12 textbook in the UAE. She is now based in the US as a free-lance materials writer and assessment consultant.

Coombe/Hubley Fundamentals of Language Assessment

The Cornerstones of Testing


Language testing at any level is a highly complex undertaking that must be based on theory as well as practice. Although this manual focuses on practical aspects of classroom testing, an understanding of the basic principles of larger-scale testing is essential. The guiding principles that govern good test design, development and analysis are validity, reliability, practicality, washback, authenticity, transparency, security and usefulness. Constant references to these important cornerstones of language testing will be made throughout the workshops.

Usefulness
For Bachman and Palmer (1996), the most important consideration in designing and developing a language test is the use for which it is intended. Hence, for them, usefulness is the most important quality or cornerstone of testing. They state that test usefulness provides a kind of metric by which we can evaluate not only the tests that we develop and use, but also all aspects of test development and use(pg. 17). Bachman and Palmers model of test usefulness requires that any language test must be developed with a specific purpose, a particular group of test takers and a specific language use in mind.

Validity
The term validity refers to the extent to which a test measures what it says it measures. In other words, test what you teach, how you teach it! Types of validity include content, construct, and face. For classroom teachers, content validity means that the test assesses the course content and outcomes using formats familiar to the students. Construct validity refers to the fit between the underlying theories and methodology of language learning and the type of assessment. For example, a communicative language learning approach must be matched by communicative language testing. Face validity means that the test looks as though it measures what it is supposed to measure. This is an important factor for both students and administrators. Other types of validity are more appropriate to large-scale assessment and are defined in the glossary.
Coombe/Hubley Fundamentals of Language Assessment

It is important that we be clear about what we want to assess and be sure that we are assessing that and not something else. Making sure that clear assessment objectives are met is of primary importance in achieving test validity. The best way to ensure validity is to produce tests to specifications.

Reliability
Reliability refers to the consistency of test scores. It simply means that a test would give similar results if it were given at another time. For example, if the same test were to be administered to the same group of students at two different times, in two different settings, it should not make any difference to the test taker whether he/she takes the test on one occasion and in one setting or the other. Similarly, if we develop two forms of a test that are intended to be used interchangeably, it should not make any difference to the test taker which form or version of the test he/she takes. The student should obtain about the same score on either form or version of the test. Three important factors affect test reliability. Test factors such as the formats and content of the questions and the length of the exam must be consistent. For example, testing research shows that longer exams produce more reliable results than very brief quizzes. In general, the more items on a test, the more reliable it is considered to be. Administrative factors are also important for reliability. These include the classroom setting (lighting, seating arrangements, acoustics, lack of intrusive noise etc.) and how the teacher manages the exam administration. Affective factors in the response of individual students can also affect reliability. Test anxiety can be allayed by coaching students in good test-taking strategies. A fundamental concern in the development and use of language tests is to identify potential sources of error in a given measure of language ability and to minimize the effect of these factors. Henning (1987) describes a number of threats to test reliability. These factors have been shown to introduce fluctuations in test scores and thus reduce reliability. Fluctuations in the Learner: A variety of changes may take place within the learner that either will introduce error or change the learners true score from test to test. Examples of this type of change might be further learning or forgetting. Influences such as fatigue, sickness, emotional problems and practice effect may cause the test takers score to deviate from the score which reflects his/her actual ability. Fluctuations in Scoring: Subjectivity in scoring or mechanical errors in the scoring process may introduce error into scores and affect the reliability of the tests results. These kinds of errors usually occur within (intra-rater) or between (inter-rater) the raters themselves. Fluctuations in Test Administration: Inconsistent administrative procedures and testing conditions may reduce test reliability. This is most 6

Coombe/Hubley Fundamentals of Language Assessment

common in institutions where different groups of students are tested in different locations on different days. Reliability is an essential quality of test scores, because unless test scores are relatively consistent, they cannot provide us with information about the abilities we want to measure. A common theme in the assessment literature is the idea that reliability and validity are closely interlocked. While reliability focuses on the empirical aspects of the measurement process, validity focuses on the theoretical aspects and seeks to interweave these concepts with the empirical ones (Davies et al, 1999 pg. 169). For this reason it is easier to assess reliability than validity.

Practicality
Another important feature of a good test is practicality. Classroom teachers are well familiar with practical issues, but they need to think of how practical matters relate to testing. A good classroom test should be teacher-friendly. A teacher should be able to develop, administer and mark it within the available time and with available resources. Classroom tests are only valuable to students when they are returned promptly and when the feedback from assessment is understood by the student. In this way, students can benefit from the test-taking process. Practical issues include cost of test development and maintenance, time (for development and test length), resources (everything from computer access, copying facilities, AV equipment to storage space), ease of marking, availability of suitable/trained markers and administrative logistics. It is important to remember that assessment is only one aspect of a teachers job and cannot be allowed to detract from teaching or preparation time.

Washback
Washback refers to the effect of testing on teaching and learning. Washback is generally said to be either positive or negative. Unfortunately, students and teachers tend to think of the negative effects of testing such as test-driven curricula and only studying and learning what they need to know for the test. Positive washback, or what we prefer to call guided washback can benefit teachers, students and administrators. Positive washback assumes that testing and curriculum design are both based on clear course outcomes which are known to both students and teachers/testers. If students perceive that tests are markers of their progress towards achieving these outcomes, they have a sense of accomplishment. In short, tests must be part of learning experiences for all involved. Positive washback occurs when a test encourages good teaching practice.

Authenticity
Language learners are motivated to perform when they are faced with tasks that reflect real world situations and contexts. Good testing or assessment
Coombe/Hubley Fundamentals of Language Assessment

strives to use formats and tasks that mirror the types of situations in which students would authentically use the target language. Whenever possible, teachers should attempt to use authentic materials in testing language skills.

Transparency
Transparency refers to the availability of clear, accurate information to students about testing. Such information should include outcomes to be evaluated, formats used, weighting of items and sections, time allowed to complete the test, and grading criteria. Transparency dispels the myths and mysteries surrounding secretive testing and the adversarial relationship between learning and assessment. Transparency makes students part of the testing process.

Security
Most teachers feel that security is an issue only in large-scale, high-stakes testing. However, security is part of both reliability and validity. If a teacher invests time and energy in developing good tests that accurately reflect the course outcomes, then it is desirable to be able to recycle the tests or similar materials. This is especially important if analyses show that the items, distractors and test sections are valid and discriminating. In some parts of the world, cultural attitudes towards collaborative test-taking are a threat to test security and thus to reliability and validity. As a result, there is a trade-off between letting tests into the public domain and giving students adequate information about tests.

Coombe/Hubley Fundamentals of Language Assessment

Cornerstones Checklist
When developing, administering and grading exams, ask yourself the following questions:

Does your exam test the curriculum content? Does your exam contain formats familiar to the students? Does your test reflect your philosophy of teaching? Would this test yield the same results if you gave it again? Will the administration of your test be the same for all classes? Have you helped students reduce test anxiety through test-taking strategies? Do you have enough time to write, grade and analyze your test? Do you have all the resources (equipment, paper, storage) you need? Will this test have a positive effect on teaching and learning? Are the exam tasks authentic and meaningful? Do students have accurate information about this test? Have you taken measures to ensure test security? Is your test a good learning experience for all involved?

Coombe/Hubley Fundamentals of Language Assessment

Test Types
The most common use of language tests is to identify strengths and weaknesses in students abilities. For example, through testing we can discover that a student has excellent oral abilities but a relatively low level of reading comprehension. Information gleaned from tests also assists us in deciding who should be allowed to participate in a particular course or program area. Another common use of tests is to provide information about the effectiveness of programs of instruction. Henning (1987) identifies six kinds of information that tests provide about students. They are: Diagnosis and feedback Screening and selection Placement Program evaluation Providing research criteria Assessment of attitudes and socio-psychological differences

Alderson, Clapham and Wall (1995) have a different classification scheme. They sort tests into these broad categories: placement, progress, achievement, proficiency and diagnostic. Placement Tests These tests are designed to assess students level of language ability for placement in an appropriate course or class. This type of test indicates the level at which a student will learn most effectively. The main aim is to create groups which are homogeneous in level. In designing a placement test, the test developer may choose to base the test content either on a theory of general language proficiency or on learning objectives of the curriculum. In the former, institutions may choose to use a well-established proficiency test such as the TOEFL or IELTS exam and link it to curricular benchmarks. In the latter, tests are based on aspects of the syllabus taught at the institution concerned. In some contexts, students are placed according to their overall rank in the test results. At other institutions, students are placed according to their level in each individual skill area. Elsewhere, placement test scores are used to determine if a student needs any further instruction in the language or could matriculate directly into an academic program. Diagnostic Tests Diagnostic tests seek to identify those language areas in which a student needs further help. Harris and McCann (1994 p. 29) point out that where other types of tests are based on success, diagnostic tests are based on failure. The information gained from diagnostic tests is crucial for further course activities and providing students with remediation. Because diagnostic
Coombe/Hubley Fundamentals of Language Assessment

10

tests are difficult to write, placement tests often serve a dual function of both placement and diagnosis (Harris & McCann, 1994; Davies et al, 1999). Progress Tests These tests measure the progress that students are making towards defined course or program goals. They are administered at various stages throughout a language course to see what the students have learned, perhaps after certain segments of instruction have been completed. Progress tests are generally teacher produced and are narrower in focus than achievement tests because they cover a smaller amount of material and assess fewer objectives. Achievement Tests Achievement tests are similar to progress tests in that their purpose is to see what a student has learned with regard to stated course outcomes. However, they are usually administered at mid- and end- point of the semester or academic year. The content of achievement tests is generally based on the specific course content or on the course objectives. Achievement tests are often cumulative, covering material drawn from an entire course or semester. Proficiency Tests Proficiency tests, on the other hand, are not based on a particular curriculum or language program. They are designed to assess the overall language ability of students at varying levels. They may also tell us how capable a person is in a particular language skill area. Their purpose is to describe what students are capable of doing in a language. Proficiency tests are usually developed by external bodies such as examination boards like Educational Testing Services (ETS) or Cambridge ESOL. Some proficiency tests have been standardized for international use, such as the American TOEFL test which is used to measure the English language proficiency of foreign college students who wish to study in NorthAmerican universities or the British-Australian IELTS test designed for those who wish to study in the UK or Australia (Davies et al, 1999).

Other Test Types


Objective vs. Subjective Tests Sometimes tests are distinguished on the basis of the manner in which they are scored. An objective test is one that is scored by comparing a students responses with an established set of acceptable/correct responses on an answer key. With objectively scored tests, no particular knowledge or training in the examined area is required of the scorer. Conversely, a subjective test requires scoring by opinion or personal judgment. In this type of test, the human element is very important.

Coombe/Hubley Fundamentals of Language Assessment

11

Testing formats associated with objective tests are MCQs, T/F/Ns and cloze. Objectively-scored tests are ideal for computer scanning. Examples of subjectively scored tests are essay tests, interviews or comprehension questions. Even experienced scorers or markers need moderation sessions to ensure inter-rater reliability. Criterion- vs. Norm-Referenced or Standardized Tests Criterion-referenced tests (CRT) are designed to enable the test user to interpret a test score with reference to a criterion level of ability or domain of content (Bachman, 1990). True CRTs are devised before instruction itself is designed so the test will match the teaching objectives. This lessens the possibility that teachers will teach to the test. The criterion or cut-off score is set in advance. Student achievement is measured with respect to the degree of their learning or mastery of the pre-specified content. A primary concern of a CRT is that it be sensitive to different ability levels. Norm-referenced (NRT) or standardized tests differ from criterion-referenced tests in a number of ways. By definition, a NRT must have been previously administered to a large sample of people from the target population. Acceptable standards of achievement are determined after the test has been developed and administered. Test results are interpreted with reference to the performance of a given group or norm. The norm is typically a large group of students who are similar to the individuals for whom the test is designed. High- vs. Low-stakes Tests High-stakes tests are those where the results are likely to have a major impact of the lives of large numbers of individuals, or on large programs. For example, a test like the TOEFL is high-stakes in that admission to a university program is often contingent upon receiving a sufficient language proficiency score. Low-stakes tests are those where the results have a relatively minor impact on the lives of the individual or on small programs. In-class progress tests or short quizzes are examples of low-stakes tests.

Coombe/Hubley Fundamentals of Language Assessment

12

Test Development Process


Planning Establish purpose of test o place students in program o achievement of course outcomes o diagnosis of strengths and areas for improvement o international benchmark Identify objectives o operationalize outcomes Decide on cutoffs o grade, mastery Inventory course content and materials o consider appropriate formats o establish overall weighting Scheduling Write test specifications Test Content and Development Map the exam o decide on sections, formats, weighting Construct items according to test specifications Establish grading criteria o prepare an answer key Vett the exam Pilot the exam Before the Test Provide information to students o coverage, weighting, formats, logistics Prepare students o student test-taking strategies o practice exam activities

Coombe/Hubley Fundamentals of Language Assessment

13

Test Administration Decide on test conditions and procedures Organize equipment needed Establish makeup policy Inform students about availability of results After the Test Grade tests o calibrate if more than one teacher involved o adjust answer key if needed Compute basic statistics Get results to students o provide feedback for remediation Conduct exam analysis o overall exam analysis o item and distractor analysis o error analysis Report on exam results o channel washback Reflect on the testing process Learn from each exam o Did it serve its purpose? o What was the fit with curricular outcomes? o Was it a valid and reliable test? o Was it part of the students learning experience? o What future changes would you make?

Coombe/Hubley Fundamentals of Language Assessment

14

Guidelines for Classroom Testing


Test to course outcomes Test what has been taught, how it has been taught Weight exam according to outcomes and course emphases Organize exam with student time allocation in mind Test one language skill at a time unless integrative testing is the intent Set tasks in context wherever possible Choose formats that are authentic for tasks and skills Avoid mixing formats within one exam task Distinguish between recognition, recall, and production in selecting formats Design test with entire test sections and tasks in mind Prepare unambiguous items well in advance Sequence items from easy to more difficult Items receiving equal weight should be of equal difficulty Write clear directions and rubrics Provide examples for each format Write more items than you need Avoid sequential items Take your test as a student before finalizing it Make the test easy and fair to grade Develop practice tests and answer keys simultaneously 15

Coombe/Hubley Fundamentals of Language Assessment

Specify the material to be tested to the students Acquaint students with techniques and formats Administer test in uniform, non-distracting conditions For subjective formats, use multiple raters whenever possible Provide timely feedback to students Reflect on exam without delay

Coombe/Hubley Fundamentals of Language Assessment

16

Writing Objective Test Items


Multiple Choice Questions (MCQs)
Multiple-choice questions are the hardest type of objective question to write for classroom teachers. Although many people believe MCQs are simplistic, actually the format can be used for intellectually challenging tasks. Teachers should keep the following guiding principles in mind when writing MCQs: The optimum number of response options for F/SL testing is four. With four response options, one should be an unambiguous correct or best answer. The three remaining options function as distractors. Distractors should attract students who are unsure of the answer. All response options should be the same length and level of difficulty. All distractors should be related in some way (e.g. same part of speech). The question or task should be clear from the stem of the MCQ. The language of the stem and response options should be as simple as possible to avoid skill contamination. The selection of the correct or best answer should involve interpretation of the passage/stem, not merely the activation of background knowledge or verbatim selection. Avoid using all of the above, none of the above, or a, b, and sometimes c, but never d options. All response options should be grammatically correct unless error identification is part of your course outcomes. Correct answers should appear equally in all positions. Make sure there is an unambiguous correct answer for each item. As much context as possible should be provided. Recurring information in response options should be moved to the stem. 17

Coombe/Hubley Fundamentals of Language Assessment

Avoid writing absurd or giveaway distractors. Avoid extraneous clues. Avoid sequential items where the successful completion of one question presupposes a correct answer to the preceding question.

Main Idea MCQ Format


The testing of the main idea of a text is frequently done via MCQs. The recommended word count of the paragraph or text itself should be based on course materials. One standard way to test main idea employs an MCQ format with the response options written in the following way: JR (just right) This option should be the correct or best answer. TG (too general) This distractor relates an option that is too broad. TS (too specific) This distractor focuses on one detail within the text or paragraph. OT (off topic) Depending on the level of the students, this distractor is written so that it reflects an idea that is not developed in the paragraph or text. For more advanced students, the idea would be related in some way.

Main idea can also be tested via the TFN format by using the This text/paragraph is mostly about... prompt.

True/False/Not Given (TFN)


True/False/Not Given questions are a reliable way of testing reading comprehension provided that there are enough questions. They have the added advantage of being easier and quicker to write than MCQs. Teachers should keep the following guidelines in mind when writing TFNs: Questions should be written in language at a lower level of difficulty than the text. Questions should appear in the same order as they appear in the text. The first question should be an easy question. This serves to reduce text anxiety. Avoid using absolutes like always or never in TFNs. Have students circle T F N rather than write a letter in a blank. 18

Coombe/Hubley Fundamentals of Language Assessment

To increase the discrimination or reduce the guessing factor, add the Not Given option. It means that the necessary information to answer the question is not included in the text. Successful completion of TFN items should depend on the students reading of the text, not on background knowledge. Avoid discernible patterns for marking i.e. TTTFFFNNN Avoid verbatim selection or simply matching the question to words/phrases in the text. Paraphrase questions by using vocabulary and grammar from course materials. The TFN format is effectively used to test reading, but should be avoided for listening comprehension.

Matching
Matching is an extended form of MCQ that draws upon the students ability to make connections between ideas, vocabulary and structure. The advantage over MCQs is that the student has more distractors per item. Additionally, writing items in the matching format is somewhat easier for teachers than either MCQs or TFNs. These are some important points to bear in mind: Include more items in the answer group than in the question group. Never write items that rely on direct 1-on-1 matching. The consequence of 1-on-1 matching is that if a student gets one item wrong, at least two are wrong by default. By contrast, if the student gets all previous items right, the last item is a process of elimination freebie. Matching can be used very effectively with related items for gap-fill paragraphs instead of two lists. In this way, students focus on meaning in context and attend to features such as collocation. If a two-column format is used for matching, number the questions and letter the answer options. Leave a space for students to write the letter of the chosen answer. This prevents lines drawn from Q to A columns. Two-column matching formats should be used sparingly for word association tasks. When this is the specific testing objective, be sure that the syntax between the two columns is correct and unambiguous. Avoid extraneous clues such as using an when the correct answer starts with a vowel.

Coombe/Hubley Fundamentals of Language Assessment

19

Assessing Writing
Most teachers find that it is relatively easy to write subjective test item prompts as contrasted to objective ones. The difficulty lies in clearly specifying the task for the student so that grading is fair and equitable to all students. Some teachers find that the best approach is to write a sample answer and then analyze the elements of that answer. Alternatively, it is useful to ask a colleague to write a sample answer and critique the prompt. Writing good subjective items is an interactive, negotiated process. The F/SL literature generally addresses two types of writing: free writing and guided writing. The former requires students to read a prompt that poses a situation and write a planned response based on a combination of background knowledge and knowledge learned from the course. Guided writing, however, requires students to manipulate content that is provided in the prompt, usually in the form of a chart or diagram.

Guided Writing
Guided writing is a bridge between objective and subjective formats. This task requires teachers to be very clear about what they expect students to do. Decide in advance whether mechanical issues like spelling, punctuation and capitalization matter when the task focuses on comprehension. Some important points to keep in mind for guided writing are: Be clear about the expected form and length of response (one paragraph, a 250-word essay, a letter etc.). If you want particular information included, clearly specify it in the prompt (i.e. three causes and effects, two supporting details etc.) Similarly, specify the discourse pattern(s) the students are expected to use (i.e. compare and contrast, cause and effect, description etc.) Since guided writing depends on the students manipulation of the information provided, be sure to ask them to provide something beyond the prompt such as an opinion, an inference, or a prediction. Be amenable to revising the anticipated answer even as you grade.

Coombe/Hubley Fundamentals of Language Assessment

20

Free Writing
All of the above suggestions are particularly germane to free writing. The goal for teachers is to elicit comparable products from students of different ability levels. The use of multiple raters is especially important in evaluating free writing. Agree on grading criteria in advance and calibrate before the actual grading session. Decide on whether to use holistic, analytical or a combination of the two as a rating scale for marking. If using a band scale, adjust it to the task. Acquaint students with the marking scheme in advance by using it for teaching, grading homework and providing feedback. Subliminally teach good writing strategies by providing students with enough space for an outline, a draft and the finished product. In ES/FL classrooms, be aware of cultural differences and sensitivities among students. Avoid contentious issues that might offend or disadvantage students.

Writing Assessment Scales


The F/SL assessment literature generally recognises two different types of writing scales for assessing student written proficiency: holistic marking and analytical marking.

Holistic Marking Scales


Holistic marking is where the scorer records a single impression of the impact of the performance as a whole McNamara (2000:43). In short, holistic marking is based on the marker's total impression of the essay as a whole. Holistic marking is variously termed as impressionistic, global or integrative marking. Experts in holistic marking scales recommend that this type of marking is quick and reliable if 3 to 4 people mark each script. The general rule of thumb for holistic marking is to mark for two hours and then take a rest grading no more than 20 scripts per hour. Holistic marking is most successful using scales of a limited range (i.e. from 0-6). FL/SL educators have identified a number of advantages to this type of marking. First, it is reliable if done under no time constraints and if teachers receive adequate training. Also, this type of marking is generally perceived to be quicker than other types of writing assessment and enables a large number of scripts to be scored in a short period of time. Thirdly, since overall writing ability is
Coombe/Hubley Fundamentals of Language Assessment

21

assessed, students are not disadvantaged by one lower component such as poor grammar bringing down a score. Several disadvantages of holistic marking have also been identified. First of all, this type of marking can be unreliable if marking is done under short time constraints and with inexperienced, untrained teachers (Heaton, 1990). Secondly, Cohen (1994) has cautioned that longer essays often tend to receive higher marks. Testers point out that by reducing a score to one figure tends to reduce the reliability of the overall mark. The most serious problem associated with holistic marking is the inability of this type of marking to provide feedback to those involved. More specifically, when marks are gathered through a holistic marking scale, no information or washback on how those marks were awarded appears. Thus, testers often find it difficult to justify the rationale for the mark. Hamp-Lyons (1990) has stated that holistic marking is severely limited in that it does not provide a profile of the student's writing ability.

Analytical Marking Scales


Analytical marking is where raters provide separate assessments for each of a number of aspects of performance (Hamp-Lyons, 1991). In other words, raters mark selected aspects of a piece of writing and assign point values to quantifiable criteria (Coombe & Evans, 2001). In the literature, analytical marking has been termed discrete point marking and focused holistic marking. Analytical marking scales are generally more effective with inexperienced teachers. These scales are more reliable for scales with a larger point range. A number of advantages have been identified with analytical marking. Firstly, unlike holistic marking, analytical writing scales provide teachers with a "profile" of their students' strengths and weaknesses in the area of writing. Additionally, this type of marking is very reliable if done with a population of inexperienced teachers who have had little training and grade under short time constraints (Heaton, 1990). Finally, training raters is easier because the scales are more explicit and detailed. Just as there are advantages to analytical marking, educators point out a number of disadvantages associated with using this type of scale. Analytical marking is perceived to be more time consuming because it requires teachers to rate various aspects of a student's essay. It also necessitates a set of specific criteria to be written and for markers to be trained and attend frequent moderation or calibration sessions. These moderation sessions are to insure that inter-marker differences are reduced which thereby increase the validity. Also, because teachers look at specific areas in a given essay, the most common being content, organization, grammar, mechanics and vocabulary, marks are often lower than for their holistically-marked counterparts. Another disadvantage is that that analytical marking scales remove the integrative nature of writing assessment.

Selecting the Appropriate Marking Scale


Selecting the appropriate marking scale depends upon the context in which a teacher works. This includes the availability of resources, amount of time allocated to getting reliable writing marks to administration, the teacher population
Coombe/Hubley Fundamentals of Language Assessment

22

and management structure of the institution. Reliability can be increased by using multiple marking, which reduces the scope for error that is inherent in a single score.

Writing Moderation/Calibration Process


For test reliability, it is recommended that clear criteria for grading be established and that rater training in using these criteria take place prior to marking. The criteria can be based on holistic or analytical rating scales. However, whatever scale is chosen, it is crucial that all raters adhere to the same scale regardless of their personal preference. The best way to achieve inter-rater reliability is to practice. Start early in the academic year by employing the marking criteria in non-test situations. Make students aware from the outset of the criteria and expectations for their work. Collect a range of student writing samples on the same task and have teachers evaluate and discuss them until they arrive at a consensus score. Involve students in peer-grading of classroom writing to familiarize them with marking criteria. This has the benefit of making students more aware of ways in which they can edit and improve their writing.

Recommendations for Writing Assessment


As always, assessment should first and foremost reflect the goals of the course. In order for writing assessment to be fair for students, they should have plenty of opportunities to practice a variety of different writing skills of varying lengths. In other words, tests of writing should be shorter and more frequent, not just a "snapshot" approach at midterm and final exams.

Coombe/Hubley Fundamentals of Language Assessment

23

Assessing Reading
Most language teachers assess reading through the component subskills. Since reading is a receptive language skill, we can only get an idea of how students actually process texts through techniques such as think aloud protocols. It is not possible to observe reading behavior directly. For assessment, we normally focus on certain important skills which can be divided up into major and minor (or contributing) reading skills. Major reading skills include: Reading quickly to skim for gist, scan for specific details, and establish overall organization of the passage Reading carefully for main ideas, supporting details, authors argument and purpose, relationship of paragraphs, fact vs. opinion Information transfer from nonlinear texts Minor reading skills include: understanding at the sentence level syntax, vocabulary, cohesive markers understanding at inter-sentence level reference, discourse markers understanding components of nonlinear texts the meaning of graph or chart labels, keys, and the ability to find and interpret intersection points. It should be noted that the designations major and minor largely relate to whether the skills pertain to large segments of the text or whether they focus on certain local structural or lexical points. Increasingly, grammar and vocabulary are contextualized as part of reading passages instead of being assessed separately in a discrete point fashion. However, there are times when it is appropriate to assess structure, vocabulary, and language-in-use separately. Reading texts include both prose passages and nonlinear texts such as tables, graphs, schedules, advertisements and diagrams. Texts for assessment should be carefully chosen to fit the purpose of assessment and the level of the students taking factors such as text length, density and readability into account. For assessment, avoid texts with controversial or biased material because they can upset students and affect the reliability of test results. Ninety percent of the vocabulary in a prose passage should be known to the students (Nation, 1990). Reading tests use many of the formats already discussed. Recognition formats include MCQs, TFNs, matching and cloze with answers provided. If limited production formats such as short answer are used, usually the emphasis is on meaning, not spelling. Of course, there will be authentic tasks such as reading directions for form-filling where accuracy is important.
Coombe/Hubley Fundamentals of Language Assessment

24

Specifications
As with all skills assessment, it is important to start with a clear understanding of program objectives, intended outcomes and target uses of English. Once these are clear, you can develop specifications or frameworks for developing assessment. Specifications will clearly state what and how you will assess, what the conditions of assessment will be (length and overall design of the test), and will provide criteria for marking or grading. Here are typical features of specifications: Content What material will the test cover? What aspects of this material? What does the student have to be able to do? For example, in reading, perhaps a students has to scan for detailed information. A propos reading passages, specifications state the type of text (prose or nonlinear), the number of words in the passage and readability level. Acceptable topics and the treatment of vocabulary are usually set forth in specifications. For instance, topics may be restricted to those covered in the student book and vocabulary may focus on core vocabulary in the course. Conditions Specifications usually provide information about the structure of the examination and the component parts. For example, a reading examination may include 5 subsections which use different formats and texts to test different subskills. Specific formats or a range of formats are usually given in specifications in addition to the number of questions for each format or section. Timing is another condition which specifications state. The time for the entire test may be given or sometimes for each individual subsection. For example, you can place time-dependent skills such as skimming and scanning in separately timed sections or you can place them at the end of a longer reading test where students typically are reading faster to finish within the allocated time. Grading criteria Specifications indicate how the assessment instrument will be marked. For instance, the relative importance of marks for communication as contrasted to those for mechanics (spelling, punctuation, capitalization) should reflect the overall approach and objectives of the instructional program. Similarly, if some skills are deemed more important or require more processing than other skills, they may be weighted more heavily. In short, specifications help teachers and administrators establish a clear linkage between the overall objectives for the program and the design of particular assessment instruments. Specifications are especially useful for ensuring even coverage of the main skills and content of courses as well as developing tests that are comparable to one another because they are based on the same guidelines.

Coombe/Hubley Fundamentals of Language Assessment

25

Recommendations for Reading Assessment


Texts Texts can be purpose written, taken directly from authentic material or adapted. The best way to develop good reading assessments is to constantly be on the watch for appropriate material. Keep a file of authentic material from newspapers, magazines, brochures, instruction guides anything that is a suitable source of real texts. Other ways to find material on particular topics are to use an encyclopedia written at an appropriate readability level or to use an Internet search engine. Whatever the source, cite it properly. Microsoft Word provides word counts and readability statistics. First, highlight the passage, and then select word count from the Tool menu. To access readability information, go to Options under the Tool menu, then Spelling and Grammar, and tick Show Readability Statistics. Readability is based on word and sentence length as well as use of the passive voice. You can raise or lower the level by changing these. You can also add line numbers and other special features to texts. Questions Make sure that questions are written at a slightly lower level than the reading passages. Reading questions should be in the same order as the material in the passage itself. If you have two types of questions or two formats based on one text, go through the text with different colored markers to check that you have evenly covered the material in order. For objective formats such as multiple choice and true/false/not given, try to make all statements positive. If you phrase a statement negatively and an option is negative as well, students have to deal with the logical problems of double negatives. Whenever possible, rephrase material using synonyms to avoid students scanning for verbatim matches. Paraphrasing encourages vocabulary growth as positive washback.

Coombe/Hubley Fundamentals of Language Assessment

26

Assessing Listening
The assessment of listening abilities is one of the least understood, least developed and yet one of the most important areas of language testing and assessment (Alderson & Bachman, 2001). In fact, Nunan (2002) calls listening comprehension the poor cousin amongst the various language skills because it is the most neglected skill area. As teachers we recognize the importance of teaching and then assessing the listening skills of our students, but - for a number of reasons - we are often unable to do this effectively. One reason for this neglect is the availability of culturally appropriate listening materials suitable for EF/SL contexts. The biggest challenges for teaching and assessing listening comprehension center around the production of listening materials. Indeed, listening comprehension is often avoided because of the time, effort and expense required to develop, rehearse, record and produce high quality audio tapes or CDs.

Approaches to Listening Assessment


Buck (2001) has identified three major approaches to the assessment of listening abilities: discrete point, integrative and communicative approaches. The discrete-point approach became popular during the early 1960s with the advent of the Audiolingual Method. This approach identified and isolated listening into separate elements. Some of the question types that were utilized in this approach included phonemic discrimination, paraphrase recognition and response evaluation. An example of phonemic discrimination is assessing students by their ability to distinguish minimal pairs like ship/sheep. Paraphrase recognition is a format that required students to listen to a statement and then select the option closest in meaning to the statement. Response evaluation is an objective format that presents students with questions and then four response options. The underlying rationale for the discrete-point approach stemmed from two beliefs. First, it was important to be able to isolate one element of language from a continuous stream of speech. Secondly, spoken language is the same as written language, only it is presented orally. The integrative approach starting in the early 1970s called for integrative testing. The underlying rationale for this approach is best explained by Oller (1979:37) who stated whereas discrete items attempt to test knowledge of language one bit at a time, integrative tests attempt to assess a learners capacity to use many bits at the same time. Proponents of the integrative approach to listening assessment believed that the whole of language is greater than the sum of its parts. Common question types in this approach were dictation and cloze.
Coombe/Hubley Fundamentals of Language Assessment

27

The third approach, the communicative approach, arose at approximately the same time as the integrative approach as a result of the Communicative Language Teaching movement. In this approach, the listener must be able to comprehend the message and then use it in context. Communicative question formats must be authentic in nature.

Issues in Listening Assessment


A number of issues make the assessment of listening different from the assessment of other skills. Buck (2001) has identified several issues that need to be taken into account. They are: setting, rubric, input, voiceovers, test structure, formats, timing, scoring and finding texts. Each is briefly described below and recommendations are offered. Setting The physical characteristics of the test setting or venue can affect the validity and/or reliability of the test. Exam rooms must have good acoustics and minimal background noise. Equipment used in test administrations should be well maintained and checked out beforehand. In addition, an AV technician should be available for any potential problems during the administration. Rubric Context is extremely important in the assessment of listening comprehension as test takers dont have access to the text as they do in reading. Context can be written into the rubric which enhances the authenticity of the task. Instructions to students should be in the students L1 whenever possible. However, in many teaching situations, L1 instructions are not allowed. When L2 instructions are used, they should be written at one level of difficulty lower than the actual test. Clear examples should be provided for students and point values for questions should be included in the rubrics. Input Input should have a communicative purpose. In other words, the listener must have a reason for listening. Background or prior knowledge needs to be taken into account. There is a considerable body of research that suggests that background knowledge affects comprehension and test performance. In a testing situation, we must take care to ensure that students are not able to answer questions based on their background knowledge rather than on their comprehension. Voiceovers Anyone recording a segment for a listening test should receive training and practice beforehand. In large-scale testing, it is advisable to use a mixture of genders, accents and dialects. To be fair for all students, listening voiceovers should match the demographics of the teacher population. Other issues are the use of non-native speakers for voiceovers and the speed of delivery. Our belief is that non-native speakers of English constitute the majority of English speaking people in the world. Whoever is used for listening test voiceovers, whether native or non-native speakers, should speak clearly and enunciate
Coombe/Hubley Fundamentals of Language Assessment

28

carefully. The speed of delivery of a listening test should be consistent with the level of the students and the materials used for instruction. If your institution espouses a communicative approach, then the speed of delivery for listening assessments should be native or near native delivery. The delivery of the test should be standard for all test takers. If live readers are used, they should practice reading the script before the test and standardize with other readers.

Test Structure The way a test is structured depends largely on who constructs it. There are generally two schools of thought on this: British and the American perspectives. British exam boards generally grade input from easy to difficult in a test and mix formats within a section. This means that the easier sections come first with the more difficult sections later. American exam boards, on the other hand, usually grade question difficulty within each section of an exam and follow the 30/40/30 rule. This rule states that 30% of the questions within a test or test section are of an easy level of difficulty; 40% of the questions represent mid range levels of difficulty; and the remaining 30% of the questions are of an advanced level of difficulty. American exam boards usually use one format within each section. The structure you use should be consistent with external benchmarks you use in your program. It is advisable to start the test with an easy question. This will lower students test anxiety by relaxing them at the outset of the test. Within a listening test, it is important to test as wide a range of skills as possible. Questions should also be ordered as they are heard in the passage. Questions should always be well-spaced out in the passage for good content coverage. It is recommended that no content from the first 15-20 seconds of the recording be tested to allow students to adjust to the listening. Many teachers only include test content which is easy to test, such as dates and numbers. Include some paraphrased content to challenge students.

Formats Perhaps the most important piece of advice here is that students should never be exposed to a new format in a testing situation. If new formats are to be used, they should be first practiced in a teaching situation and then introduced into the testing repertoire. Objective formats like MCQs and T/F are often used because they are more reliable and easier to mark and analyze. When using these formats, make sure that the N option is dropped from T/F/N and that three response options instead of four are utilized for MCQs. Remember that with listening comprehension, memory plays a role. Since students dont have repeated access to the text, more options add to the memory load and affect the difficulty of the task and question. Visuals are often used as part of listening comprehension assessment. When using them as input, make certain that you use clear copies that reproduce well.

Coombe/Hubley Fundamentals of Language Assessment

29

Skill contamination is an issue that is regularly discussed with regard to listening comprehension. Skill contamination is the idea that a test-taker must use other language skills in order to answer questions on a listening test. For example, a test taker must first read the question and then write the answer. Whereas skill contamination used to be viewed negatively in the testing literature, it is now viewed more positively and termed skill integration. Timing The length of a listening test is generally determined by one of two things: the length of the tape or the number of repetitions of the passages. Most published listening tests do not require the proctor to attend to timing. He/she simply inserts the tape or CD into the machine. The test is over when the proctor hears a pre-recorded this is the end of the listening test statement. For teacher-produced listening tests, the timing of a test will usually be determined by how many times the test takers are permitted to hear each passage. Proficiency tests like the TOEFL usually allow one repetition whereas achievement tests usually repeat the input twice. Buck (2001) recommends that if youre assessing main idea, input should be heard once and if youre assessing detail, input should be heard twice. According to Carroll (1972), listening tests should not exceed 30 minutes. It is important to remember to give students time to pre-read the questions before the test and answer the questions throughout the test. If students are required to transfer their answers from the test paper to an answer sheet, extra time to do this should be built into the exam. Scoring The scoring of listening tests provides numerous challenges to the teacher/tester. Dichotomous scoring (questions that are either right or wrong) is easier and more reliable. However, it doesnt lend itself to many of the communicative formats such as note-taking. Other issues are whether points are deducted for grammar or spelling mistakes or non-adherence to word counts. When more than one teacher is participating in the marking of a listening test, calibration or standardization training should be completed to ensure fairness to all students. Finding Suitable Texts Many teachers feel that the unavailability of suitable texts is listening comprehensions most pressing issue. The reason for this is that creating scripts which have the characteristics of oral language is not an easy task. Some teachers simply take a reading text and transform it into a listening script. The transformation of reading texts into listening scripts results in contrived and inauthentic listening tasks because written texts often lack the redundant features which are so important in helping us understand speech. A better strategy is to look for texts that concentrate on characteristics that are unique to listening. If you start collecting texts that have the right oral features, you can then construct tasks around them. When graphics or visuals are used as test context, teachers often find themselves driven by clip art. This occurs when teachers build a listening script around readily

Coombe/Hubley Fundamentals of Language Assessment

30

available graphics. It is best to inventory the topics in a course and collect appropriate material well in advance of exam construction. To produce more extemporaneous listening recordings, use available programs on your computer like Sound Recorder or shareware like Audacity and PureVoice to record scripts for use as listening assessments in the classroom. Vocabulary Research recommends that students must know between 90-95% of the words to understand a text/script. Indeed the level of the vocabulary that you utilize in your scripts can affect the difficulty and hence the comprehension of students. If your institution employs word lists, it is recommended that you seed vocabulary from your own word lists into listening scripts whenever possible. To determine the vocabulary profile of your text/script, go to http://www.er.uqam.ca/nobel/r21270/cgi-bin/webfreqs/web_vp.cgi for Vocabulary Profiler, a very user-friendly piece of software. By simply pasting your text into the program, you will receive information about the percentage of words that come from Nations 1000 Word List and the Academic Word List. Another thing to remember about vocabulary is that lexical overlap can affect difficulty. Lexical overlap refers to when words used in the passage are used in the questions and response options. When words from the passage are used in the correct answer or key, the question is easier. The question becomes more difficult if lexical overlap occurs from the passage/script to the distractors. A final thought on vocabulary is that unknown vocabulary should never occur as a keyable response (the actual answer) in a listening test.

Final Recommendations for Listening Assessment


No matter what the skill area, as always test developers should be guided by the cornerstones of good testing practice when constructing tests. Validity (Does it measure what it says it does?) Reliability (Are the results consistent?) Practicality (Is the test teacher-friendly?) Washback (Is feedback channeled to everyone concerned?) Authenticity (Do the tasks mirror real life contexts?) Transparency (Are expectations clear to students? Do Ss and Ts have access to information about the test/assessment?) Security (Are exams and item banks secure? Can they be reused?)

Coombe/Hubley Fundamentals of Language Assessment

31

Assessing Speaking
Always keeping the cornerstones of good assessment in mind, why do we want to test speaking? In a general English program, speaking is an important channel of communication in daily life. We want to simulate real-life situations in which students engage in conversation, ask and answer questions, and give information. In an academic English program, the emphasis might be on participating in class discussions and debates or giving academic presentations. In a Business English course, students might develop telephone skills, interact in a number of common situations involving meetings, travel, and sales as well as make reports. Whatever the teaching focus, valid assessment should reflect the course objectives and the eventual target language. Speaking is a productive language skill like writing and thus shares many issues such as whether to grade holistically or analytically. However, unlike writing, speaking is more ephemeral unless measures are taken to record student performance. Yet the presence of recording equipment can inhibit students and often recording is not practical or feasible. To score reliably, it is often necessary to have two teachers assess together. When this happens, one is the interlocutor who interacts with the speaker(s) while the other teacher, the assessor, tracks the students performance. Based on Bygates categories, Weir (1993) divides oral skills into two main groups: speaking skills that are part of a repertoire of routines for exchanging information or interacting, and improvisational skills such as negotiating meaning and managing the interaction. The routine skills are largely associated with language functions and the spoken language required in certain situations. By contrast, the improvisational skills are more general and may be brought into play at any time for clarification, to keep a conversation flowing, to change topics or to take turns. In circumstances when presentation skills form an important component of a program, naturally they should be assessed. However, avoid situations where a student simply memorizes a prepared speech. Decide which speaking skills are most germane to a particular program and then create assessment tasks that sample skills widely with a variety of tasks. While it is possible to assess speaking skills on an individual basis, most large exam boards opt to test pairs of students with pairs of testers. Within tests organized in this way, there are times when only one student speaks and other times when the students interact in a conversation. This setup makes it possible to test common routine functions as well as a range of improvisational skills. For reliability, interlocutors should work from a script so that all students get similar questions framed in the same way. In general, the teacher or interlocutor should keep in the background and only intercede if truly necessary.

Coombe/Hubley Fundamentals of Language Assessment

32

Common speaking assessment formats It is good practice to start the speaking assessment with a simple task that puts students at ease so they perform better. Often this takes the form of asking the students for some personal information. Interview: can be teacher to student or student to student. Teacher to student is more reliable when the questions are scripted. Description of a photograph or item: Students describe what they see. Narration: This is often an elaboration of a description. The student is given a series of pictures or cartoon strip for the major events in a story. Information gap activity: One student has information the other lacks and vice versa. Students have to exchange information to see how it fits together. Negotiation task: Students work together on a task where they may have different opinions. They have to reach a conclusion in a limited period of time. Roleplays: Students are given cue cards with information about their character and the setting. Some students find it difficult to project themselves into an imaginary situation and this lack of acting ability may affect reliability. Oral presentations: Strive to make them impromtu instead of rehearsed.

Recommendations for Speaking Assessment


Decide with your colleagues which speaking subskills are most important and adopt a grading scale that fits your program. Whether you adopt a holistic or analytical approach to grading, create a recording form that enables you to track students production and later give feedback for improvement. Think about these factors: fluency vs. accuracy, appropriate responses (indicating comprehension), pronunciation, accent and intonation, use of repair strategies. Train teachers in scoring and practice together until there is a high rate of interrater reliablity. Use moderation sessions with high-stakes exams. Keep skill contamination in mind. Dont give students lengthy written instructions which must be read and understood before speaking. Remember that larger samples of language are more reliable. Make sure that students speak long enough on a variety of tasks. Choose tasks that generate positive washback for teaching and learning!

Coombe/Hubley Fundamentals of Language Assessment

33

Student Test-Taking Strategies


In today's universities, grades are substantially determined by test results. So much importance is placed on students' test results that often just the word "test" makes students afraid. The best way for students to overcome this feeling of fear or nervousness is to prepare themselves with test-taking strategies. This process should begin during the first week of each semester and continue throughout the school year. The key to successful test taking lies in a student's ability to use time wisely and to develop practical study habits. Actually, effective test-taking strategies are synonymous with effective learning strategies. This section is intended to provide suggestions for longterm successful learning techniques and test-taking strategies, not quick "tricks". There is nothing that can replace the development of good study skills. The following steps will help students approach tests with confidence: Make a semester study plan. Come to class regularly. Use good review techniques. Organize pre-exam hours wisely. Plan out how to take the exam. Use strategies appropriate to the skill area. Learn from each exam experience.

Make A Semester Study Plan


Students need to plan their study time for each week of their courses. They should make schedules for themselves and revise these schedules when necessary. These schedules should: BE REALISTIC. Keep a balance between classes and studying. Block out space for study time, class time, family time and recreation time.

Coombe/Hubley Fundamentals of Language Assessment

34

INCLUDE A STUDY PLACE. Finding a good place to study will help students get started; don't forget to have all the materials needed (i.e. pens, paper, textbooks, highlighter pens etc.). INCLUDE A DAILY STUDY TIME. Students forget things almost at once after learning them, so they should immediately review materials learned in class. Students should go over the main points from each class and/or textbooks for a few minutes each night. Encourage students to do homework assignments during this time as a good way to remember important points made in class.

Come To Class Regularly


In order for language learning to take place, students need to come to class on a regular basis. It is not surprising to note that poor attendance correlates highly with poor test results. Teachers need to point out early in the semester what constitutes legitimate reasons to be absent and stress the advantages of regular attendance.

Use Good Review Techniques


If students make a semester study plan and follow it, preparing for exams should really be a matter of reviewing materials. Research shows that the time spent reviewing should be no more than 15 minutes for weekly quizzes, 2 to 3 hours for a midterm exam, and 5 to 8 hours for a final exam. When reviewing for a test, students should do the following: PLAN REVIEW SESSIONS. Look at the course outline, notes and textbooks. What are the major topics? Make a list of them. How much time was spent on each of these topics in class? Did the teacher note that some topics were more important than others? If so, these should be emphasized in review sessions. TAKE THE PRACTICE EXAM. By taking the practice exam students will have an idea of the tasks/activities that they will encounter on the real exam. They will also know the point allocation for each section. This information can help them plan their time wisely. REVIEW WITH FRIENDS. Another way of studying for an exam is to create a "study group". By studying with friends there is the advantage of sharing information with others who are reviewing the same material. A study group, however, should not take the place of studying individually.

Organize Pre-Exam Hours Wisely


Coombe/Hubley Fundamentals of Language Assessment

35

Students who have regularly reviewed course materials throughout the semester don't have to "cram" at the last minute. They can concentrate their efforts on particular areas of difficulty and conduct an overall review of the material to be tested. Physical and mental fitness are important considerations for good test taking. These can be best achieved by making sure that the student has adequate rest and nutrition in the hours preceding the exam. A well-rested and well-fed student who has prepared thoroughly is likely to be calm and self-confident, two other important factors for successful test taking. Some teachers have found it useful to encourage students in stress-reducing activities.

Strategize Your Exam Plan


An important factor in test taking is exam planning. Students should arrive early at the designated exam room and find a seat. All books and personal effects (with the exception of student ID cards and writing materials) should be left at the front of the room. Students should come prepared with several pens or pencils and an eraser. As soon as the exams have been distributed and students have been told to start the exam, the student should write his/her name and ID number on all pages of the test paper. If one section is given first, such as the listening portion of English exams, the student should focus attention on this section. With any section of the exam, the student is well-advised to do an overview of the questions, their values, and the tasks required. At this point, students should determine if the exam must be done in order (i.e. listening first) or if they can skip around between sections. The latter is not possible on some standardized exams where students must complete one section before moving on to the next. An important consideration in effective test taking is time management. When exams are written, review time is usually factored into the overall exam design. Students should be encouraged to allocate their time proportional to the value of each exam section and to allow time to review their work. Teachers when proctoring can assist students with time management by alerting them to time remaining in the exam. Computer based tests (such as the new TOEFL) often show a countdown of the remaining time. Students should be made aware of this feature during the practice exams. A recent research project investigating the reading skills of English students has yielded several disturbing findings. First, students frequently fail to read directions or read them superficially. Teachers can acquaint students with the requisite metalanguage of rubrics, and encourage them to emphasize the important points of the task. For example, teachers should point out that reading for main ideas requires very different strategies than scanning for
Coombe/Hubley Fundamentals of Language Assessment

36

specific information. Brainstorm with your students on the key terms found in rubrics. Another finding in this project is the fact that students don't spend enough time on the reading and that they don't refer back to the text as often as they perhaps should. Again, when students are reading for specific detail, it is important that they refer back to the main text for each question.

Use Strategies Appropriate To The Skill Area


Teachers should train students in effective strategies for the various skill areas to be tested. Important activities (i.e. like note-taking for listening and writing tasks) should be demonstrated to students during classroom activities. Representative strategies for English skill areas will be modeled in today's workshop.

Learn From Each Exam Experience.


Each test should be a learning experience. Go over test results with students. Teachers should note specific students' strengths and weaknesses. The analyses that teachers receive right after computer-based exams provide teachers with invaluable information in a timely manner. Teachers should use this information to send students to student support services for remediation. Each exam that the students take should help them do better on the next one.

Coombe/Hubley Fundamentals of Language Assessment

37

?
Statistics
Statistics simply mean mathematical forms of exam results. Unfortunately, the term statistics often has negative connotations for language teachers. Yet teachers can easily learn how to use statistics to get information about their students performance on a test and to check the test's reliability. Basic statistics provide information on individual students, the class as a whole, the course content and how it has been taught. Every teacher can benefit from this feedback. The most important statistics are simple arithmetic concepts which are easy to compute with a hand calculator.

Basic Statistics for Classroom Testing


The most useful statistics for classroom teachers are known as descriptive statistics. They "describe" the population of students taking the test. The mean, mode, median, standard deviation, and range are common descriptive statistics. Of these, the mean is the most important for classroom teachers. Other descriptive statistics are important for large-scale, high stakes testing and can easily be obtained with computer applications such as Excel. See the annotated bibliography for suggestions on testing books that cover the use of other statistics. The Mean: Once a test has been administered to a group of students, the first step for any classroom teacher should be to compute the mean score or arithmetic average. The mean is the sum of all the scores divided by the number of scores. Mean scores can be computed for the test as a whole or for each section (i.e. listening, reading, writing etc.) of a test. Computing a mean score can give you information as to the reliability of the test. In general, mean scores that fall within the 70th percentile (i.e. from 70 to 79) are said to be valid indicators of test reliability. For shorter or mastery quizzes, however, teachers can expect higher means. Pass/Fail Rate: Another useful statistic to compute is the pass or failure rate for a given test or quiz. This is most simply done by a grade breakdown. The first step in this process is to count the number of A's, B's, C's, and D's received on the test. This number represents the pass rate for a given test. Divide this number by the total number of students who took the test and you have the pass rate. To compute the failure rate, count the number of F's or failing grades and divide this number by the total number of students who took the exam.
Coombe/Hubley Fundamentals of Language Assessment

38

Histograms: Histograms are visual representations of how well a group of students did on a test or quiz. Histograms can be easily drawn from a chart of grade breakdowns (number of A, B, C, D and F grades received on a test). These totals are later graphed on a chart. The resulting curve represents how the class did as a whole on a test.

Computing Basic Statistics for Classroom Use


Figuring the mean 1. Add the grades of all students 2. Divide the total of the grades by the number of students 3. The result is the mean for that test or quiz. Figuring the pass rate 1. Count the number of students in each grade category In some systems, this will be A, B, C, D, F. Note that test and quiz grades can be out of any number, not just 100. 2. Divide the number of students who received a grade in all passing categories by the number of students who took the test. 3. The result is the pass rate for that class for that test. Figuring the failure rate 1. Count the number of students in each grade category In some systems, this will be A, B, C, D, F. 2. Divide the number of students who received a grade in all failing categories by the number of students who took the test. 3. The result is the failure rate for that class for that test. Plotting a histogram 1. A histogram is a picture of your grade distribution or breakdown. It is a simple graph with two axes. One side (the vertical) represents the number of students who took the exam. The horizontal side has the range of grades received on the exam. 2. Create the vertical axis by showing how many students took the exam. For example, if you have 25 students in your class, have the bottom represent 0 and the top 25 with intervals of 5 students. 3. Create the horizontal axis by showing the range of possible grades. For example, you may have F on the left side, followed by D, C, B, and A in ascending order.

Coombe/Hubley Fundamentals of Language Assessment

39

4. Plot the number of students who received each grade. Remember to note grades for which there were no scores at the zero level. Then, you can either use bars to depict the number of scores in each category or connect the dots at the top of each column (include the zeroes!).

Coombe/Hubley Fundamentals of Language Assessment

40

I.

Sample Classroom Statistics Worksheet


Class:________________ Date:_________________

Teacher:___________________________ Exam:_____________________________

Failing F 59 and below

D grades 60 to 69

C grades 70 to 79

B grades 80 to 89

A grades 90 to 100

Total number of students taking exam:_________ Number passed:_________ Pass rate: _____% Number failed: _________ Fail rate: _____% Grade breakdown: A: n = ____ % = ____ B: n = ____ % = ____ C: n = ____ % = ____ D: n = ____ % = ____ (cusp: n = ____ % = ____) F: n = ____ % = ____ Class Mean: Total of all scores: _______ ____divided by_____________ = Number of Ss: _______ __________ mean

25 20 15 10 5 0 F D C B A
41

Coombe/Hubley Fundamentals of Language Assessment

Technology for Testing


Technology is increasingly employed in testing. The last decade has progressed from scanned examinations to the recent widespread use of computer adaptive testing. Each institution embraces technology according to available resources, but whatever the use, technology should always be in the service of teaching and testing, not the other way around. Scanned Tests Many schools and colleges use scanners to quickly and accurately grade and analyze objective tests. From a testing perspective, the most important consideration is that items and tasks follow good testing practice since the results will only be as good and valid as the test itself. The use of a scanner requires that students fill in a special answer or bubble sheet that is machine-readable. The technical term for these bubble sheets is an OMR or Optical Mark Reader. Some students may inadvertently transfer information incorrectly from the question paper to the answer sheet so students should receive training before scan sheets are used for high-stakes examinations. Computer Based Testing Computer-based testing (CBT) has numerous advantages including: quick, accurate results and feedback detailed statistical analysis easy administration with a high level of security item banks of validated test items encouragement of certain effective test-taking skills However, the use of CBTs requires access to hardware and software and special training in test construction using formats that are amenable to machine scoring. Additionally, some skills such as writing and speaking can only be computer tested with sophisticated equipment. Computer Adaptive Testing Recently, institutions such as ETS have developed language tests using computer adaptive testing (CAT). CATs are tailored to individual ability levels since each question presented to the candidate is based on their response to the previous question. In addition to all the CBT features noted above, advantages include shorter testing time and the establishment of the individuals unique level. CAT requires sophisticated equipment and test writing skills based on item response theory.

Coombe/Hubley Fundamentals of Language Assessment

42

Internet Resources
The Internet provides access to global resources on testing. Some Internet sites address generic testing issues while others are specific to English language testing. Most of the sites listed below point to other Internet testing resources. Please note that URLs or addresses change often; those provided were accurate as of mid-March 2004. Language testers consider Glenn Fulchers Resources in Language Testing site (http://www.dundee.ac.uk/languagestudies/ltest/ltr.html) the premier Internet resource. Dr. Fulcher maintains this site which directs the searcher to virtually all important general and language assessment sites on the Internet. Information about language testing conferences as well as a range of articles on different aspects of assessment can be found here. In addition, Dr. Fulcher also maintains the International Language Testing Associations (ILTA) website and provides links to the testing list serve LTEST-L. The Clearinghouse on Assessment and Evaluations site (http://www.ericae.net) provided by ERIC, the Educational Resources Information Center, provides linkage to a wide range of resources including the ERIC Search Wizard. Useful aspects of the ERIC site include the ERIC Thesaurus online which facilitates finding keywords for searches and RIE, Resources in Education. It is also the access point for an excellent peerreviewed electronic journal entitled Practical Assessment, Research and Evaluation. CRESST, the National Center for Research on Evaluation Standards and Student Testing site (http://cresst96.cse.ucla.edu/index.htm) is funded by the U.S. Department of Education. It is primarily useful for K-12 teachers. Information about North American high-stakes testing is found at the site maintained by Educational Testing Service (http://www.ets.org/), the developers of TOEFL, GRE and other standardized tests. These sites help teachers determine which tests are most appropriate for their students and also provide specific information on test specifications for preparing the students to sit the exams. Look here for information on the next generation TOEFL exam to be launched in 2005. Language Testing Update from the University of Lancaster provides online summaries of recent publications and news of professional conferences at their site (http://www.ling.lancs.ac.uk/pubs/ltu/ltumain.htm). Teachers who want to become more familiar with Computer Adaptive Testing can learn more about it at the University of Minnesotas CARLA website which employs a FAQ (frequently asked questions) format (http://carla.acad.umn.edu/CAT.html).

Coombe/Hubley Fundamentals of Language Assessment

43

Alternative Assessment
Traditional vs. Alternative Assessment One useful way of understanding the concept of alternative assessment is to contrast it to traditional testing. Alternative assessment is different from traditional assessment in that it actually asks students to show what they can do. Students are evaluated on what they integrate and produce rather than on what they are able to recall and reproduce. (Huerta-Macias, 1994). Tests are a means of determining whether students have learned what they have been taught. Sometimes tests serve as feedback to students on their overall progress in a language course. By contrast, alternative assessment provides alternatives to traditional testing in that it: does not intrude on regular classroom activities reflects the curriculum that is actually being implemented in the classroom provides information on the strengths and weaknesses of each individual student provides multiple indices that can be used to gauge student progress is more multiculturally sensitive and free of the norm, linguistic, and cultural biases found in traditional testing (Huerta-Macias, 1994) Bailey (1998 pg. 207) provides us with a very useful chart that effectively contrasts traditional and alternative assessment (see Figure 1).

Figure 1: Contrasting Traditional and Alternative Assessment


Traditional Assessment One-shot tests Indirect tests Inauthentic tests Individual projects No feedback provided to learners Timed exams Decontextualized test tasks Norm-referenced score interpretation Standardized tests
Coombe/Hubley Fundamentals of Language Assessment

Alternative Assessment Continuous, longitudinal assessment Direct tests Authentic assessment Group projects Feedback provided to learners Untimed exams Contextualized test tasks Criterion-referenced score interpretation Classroom-based tests 44

Types of Alternative Assessment Self-assessment


Self-assessment plays a central role in student monitoring of progress in a language program. It refers to the students evaluation of his or her own performance at various points in a course. An advantage of self-assessment is that student awareness of outcomes and progress is enhanced.

Portfolio Assessment
Portfolios are collections assembled by both teacher and student of representative samples of on-going work over a period of time. The best portfolios are more than a scrapbook or folder of all my papers; they contain a variety of work in various stages and utilize multiple media.

Student-designed Tests
A novel approach within alternative assessment is to have students write tests on course material. This process results in greater learner awareness of course content, test formats, and test strategies. Student-designed tests are good practice and review activities that encourage students to take responsibility for their own learning.

Learner-centered Assessment
Learner-centered assessment advocates using input from learners in many areas of testing. For example, students can select the themes, formats and marking schemes to be used. Involving learners in aspects of classroom testing results in reduced test anxiety and greater student motivation.

Projects
Typically, projects are content-based and involve a group of students working together to find information about a topic. In the process, they use authentic information sources and have to evaluate what they find. Projects usually culminate in a find product in which the information is given to other people. This product could be a presentation, a poster, a brochure, a display or many other options. An additional advantage of projects is that they integrate language skills in a real-life context.

Coombe/Hubley Fundamentals of Language Assessment

45

Presentations
Presentations can be an assessment tool in themselves for speaking, but more often they are integrated into other forms of alternative assessment. Increasingly, students make use of computer presentation software which helps them to clarify the organization and sequence of their presentation. Presentations are another real-life skill that give learners an opportunity to address some of the socio-cultural aspects of communication such as using appropriate register and discourse.

Future Directions
Because language performance depends heavily on the purpose for language use and the context in which it is done, it makes sense to provide students with assessment opportunities that reflect these practices. In addition, we as language testers must be responsive to differing learning styles of students. In the real world, we must demonstrate that we can complete tasks using the English language effectively both at work and in social settings. Our assessment practices must reflect the importance of using language both in and outside of the language classroom.

Coombe/Hubley Fundamentals of Language Assessment

46

Testing Acronyms: How to Sound Like an Expert


Acronym ACTFL ALTE ASLPR CAE CAT CALT CBT CCSE CPE CRT EPT ETS FCE FSI Scales IELTS ILTA IRT KET LAB LTRC LTU MCQ MELAB MLAT MTELP NRT OPI PET SEM T/F/N TOEFL TOEIC TWE UCLES YLE What it stands for American Council for the Teaching of Foreign Languages Association of Language Testers in Europe Australian Second Language Proficiency Ratings Certificate in Advanced English Computer-Adaptive testing Computer-Assisted Language Testing Computer-based testing Certificates in Communicative Skills in English Certificate of Proficiency in English Criterion-referenced Testing English Placement Test (UMich) Educational Testing Services First Certificate in English Foreign Service Institute International English Language Testing System International Language Testing Association Item response theory Key English Test Language Aptitude Battery (Pimsleur) Language Testing Research Colloquium Language Testing Update Multiple-Choice Question Michigan English Language Assessment Battery Modern Language Aptitude Test Michigan Test of English Language Proficiency Norm-referenced Testing Oral Proficiency Interview (best known ACTFL test) Preliminary English Test Standard error of measurement True/False/Not Given question Test of English as a Foreign Language Test of English for International Communication Test of Written English University of Cambridge Local Examinations Syndicate Cambridge Young Learners English Test

Coombe/Hubley Fundamentals of Language Assessment

47

Glossary Of Important Testing Terms


Achievement test: measures what a learner knows from what he/she has been taught; this type of test is typically given by the teacher at a particular time during a course and covers a certain amount of material. Alignment: the process of linking content and performance standards to assessment, instruction, and learning in classrooms. Alternative assessment: refers to a non-conventional way of evaluating what students know and can do with the language; it is informal and usually administered in class; examples of this type of assessment include selfassessment and portfolio assessment. Alternate forms: different editions of the same assessment written to meet common specifications and comparable in most respects, except that some or all of the questions differ in content. Analytical scale: a type of rating scale that requires teachers to give separate ratings for the different components of language ability i.e. content, grammar, vocabulary etc. This type of evaluation requires teachers to consider multiple dimensions of performance rather than give an overall impression. Anchor items: a set of items that remains the same in two or more forms of a test for the purposes of equating; a characteristic found in computer adaptive tests and IRT. Aptitude test: a test of general ability that is usually not closely related to a specific curriculum and that is used primarily to predict future performance. Assessment: the process of gathering, describing or quantifying information about performance. Authenticity: refers to evaluation based mainly on real-life experiences; students show what they have learned by performing tasks similar to those required in real-life contexts; one of the cornerstones of good testing practice. Banding scale: a type of holistic scale that measures language competence via descriptors of language ability; an example is the IELTS bands

Coombe/Hubley Fundamentals of Language Assessment

48

Benchmark: a detailed description of a specific level of student performance expected of students at particular ages, grades or development levels. Bias: in general usage, this terms refers to unfairness. Branching test: an assessment in which test takers may be given different sets of items, depending on their responses to earlier items; this is a characteristic of computer adaptive testing. Ceiling effect: the phenomenon where most test takers score near the top of the scale on a particular test; test does not discriminate adequately at the higher ability levels. Composite score: a score that is the combination of two or more scores by some specified formula. Computer-based testing (CBT): tests that are administered to students on computer; question formats are frequently objective, discrete-point items; these tests are subsequently scored electronically. Computer-adaptive testing (CAT): presents language items to the learner via computer; subsequent questions on the exam are "adapted" based on a student's response(s) to a previous question(s). Concurrent validity: relationship between a test and another existing measure. Construct: the complete set of knowledge, skills, abilities, or traits an assessment is intended to measure. Content validity: this type of validity refers to testing what you teach how you teach it; testing content covered in some way in the course materials using formats that are familiar to the student. Cornerstones of good testing practice: concepts that underpin good testing practice; they include usefulness, validity, reliability, practicality, transparency, authenticity, security and washback. Construct validity: refers to the fit between the theoretical and methodological approaches used in a program and the assessment instruments administered. Constructed-response item: a type of test item requiring students to produce their own responses, rather than select from a range of responses provided. Criterion-referenced test: compares students' performance to particular outcomes or expectations.

Coombe/Hubley Fundamentals of Language Assessment

49

Curve grades: this refers to a practice whereby teachers add or subtract points to a test in order to make the results seem more acceptable; sometimes referred to as adjusting scores. Cut score: a point on a scale above which test takers are classified in one way and below which they are classified in a different way. Descriptive statistics: statistics that describe the population or provide summary data of the population taking the test; the most common descriptive statistics include mean, mode, medium, standard deviation and range; they are also known as the measures of central tendency. Diagnostic test: a type of formative evaluation that attempts to diagnose students' strengths and weaknesses; typically students receive no grades on diagnostic instruments. Difficulty: the extent to which an item is within the ability range of the student. Directed-response item: a test item that is designed to elicit an answer from a closed or constrained set of options. Direct test: a test which measures ability through a performance that approximates an authentic language scenario. Discrete-point test: an objective test that measures the students' ability to answer questions on a particular aspect of language; discrete-point items are very popular with teachers because they are easy to score. Discrimination: the power of an item to differentiate among test takers at different levels of ability. Distractor: a response in a forced choice item which is not the key. Distribution: the spread and pattern of a set of test scores or data. Equating: a statistical procedure used to adjust scores on two or more alternate forms of an assessment so that the scores may be used interchangeably. Equity: is the concern for fairness or that assessments are free from bias or favoritism. At minimum, all assessments should be reviewed for a) stereotypes, b) situations that favor one culture over another, c) excessive language demands that prevent some students from showing their knowledge and d) the assessments potential to include students with disabilities or limited English proficiency. Equivalent forms: see alternate forms

Coombe/Hubley Fundamentals of Language Assessment

50

Error: nonsystematic fluctuations in scores caused by such factors as guessing, unreliable scoring; error is the difference between an observed score and the true score for an individual. Evaluation: when used for most educational settings, evaluation means to measure, compare, and judge the quality of student work, schools or a specific educational program. Face validity: refers to the overall appearance of the test; it is the extent to which a test appeals to test takers. Fairness: the extent to which a test is appropriate for members of different groups regardless of gender, ethnicity etc. Forced-choice item: an item which requires the test taker to choose from response options that are provided. Formative evaluation: refers to tests that are designed to measure students' achievement of instructional objectives; these tests give feedback on the extent to which students have mastered the course materials. Examples of this type of evaluation include achievement tests and mastery tests. Grade Inflation: this refers to the practice of giving students higher grades than what they deserve or grades that are not commensurate with their language ability levels. Halo effect: the tendency of a rater to let overall impressions/judgments of a person influence judgments on more specified criteria. High-stakes test: the extent to which the outcome of a test or an assessment can effect the test takers future; a high-stakes test is a test where the test takers future hinges on passing or failing. Histogram: a graphic method of presenting statistical information. Holistic scoring: is based on an impressionistic method of scoring; an example of this is the scoring used with the TOEFL Test of Written English (TWE) or the IELTS banding system. Impact: the effect that a test has on an individual student, an educational system and on society. Indirect test: a test that does not require the student to perform tasks that directly relate to the kind of language use targeted in the classroom. Integrative testing: goes beyond discrete-point test items and contextualizes language ability; test takers are required to combine various skills to answer the test questions. Partial dictation is an example.

Coombe/Hubley Fundamentals of Language Assessment

51

Inter-rater reliability: attempts to standardize the consistency of marks between raters; it is established through rater training and calibration. Item: an individual question or exercise in an assessment or evaluative instrument. Item bank: a large bank or number of items measuring the same skill or competency; item banks are most frequently found in objective testing in particularly CBT and CAT. Item analysis: a procedure whereby test items and distractors are examined based on the level of difficulty of the item and the extent to which they discriminate between high-achieving and low-achieving students. Results of item analyses are used in the upkeep and revision of item banks. Item response theory (IRT): a mathematical model relating performance on questions to characteristics of the test takers and characteristics of the item. Item violation: refers to a common mistake that teachers/testers make when writing test items. Inter-rater reliability: the extent to which there is consistency between multiple graders or raters. Intra-rater reliability: the extent to which a rater is consistent in using a proficiency rating scale; refers to inner consistency. Key: a correct answer to a question Live pilot: this is a practice used by institutions who do not have the time or resources to pilot test items; it refers to administering a test that has not previously been piloted or pre-tested but one that is solidly based on empirically validated specifications and written by trained testers. Live test: a test that is currently in use or one that is being stored for a future administration. Mean: known as the arithmetic average; to obtain the mean, the scores are added together and then divided by the number of students who took the test; the mean is a descriptive statistic. Median: one of the measures of central tendency; it represents the 50th percentile or the middle score. Mode: the most frequently received score in a distribution. Moderation: this refers to a process of review or evaluation of test materials and rating performance.

Coombe/Hubley Fundamentals of Language Assessment

52

Monkey score: this refers to the random guess or literally the score that a monkey would receive on an item should it randomly point to an answer; for a MCQ with four response options the monkey score is 25%. Multiple-choice test: an item where the student is required to select the correct/best answer from a selection of response options. MCQs include a stem (the question to be answered or the sentence to be completed) and a number of response options. One response option is the key while the others are distractors. Norm-referenced test: measures language ability against a standard or "norm" performance of a group; standardized tests like the TOEFL are normreferenced tests because they are normed through prior administrations to large numbers of students. Objective test: can be scored based solely on an answer key; it requires no expert judgment or subjectivity on the part of the scorer. Observed score: the score a person happens to obtain on a particular form of an assessment at a particular administration. Ordering: refers to the sequencing of test items on a given test; considered to be an important factor in test development which can affect scores; generally two ways to order or sequence items; 1) a few easy items are placed at the beginning of the test whereby the rest are sequenced at random throughout; 2) the items are sequenced from easy to difficult. Outlier: this refers to an extreme or rogue score which does not seem to belong to the general answer pattern of the population; outliers may skew the distribution as the mean is very sensitive to them. Parallel tests: multiple versions of a test; they are written with test security in mind; they share the same framework, but the exact items differ. Patching: this is a practice in high-stakes institutional testing whereby separate sub scores are accepted on different test administrations; a student might take an exam and fail two of the three sections; therefore the two sections that he/she passed would not need to be repeated. Performance-based test: requires students to show what they can do with the language as opposed to what they know about the language; they are often referred to as task-based. Performance standards: explicit definitions of what students must do to demonstrate proficiency at a specific level. Piloting: a common practice among language testers; piloting is a practice whereby an item or a format is administered to a small random or representative selection of the population to be tested; information from

Coombe/Hubley Fundamentals of Language Assessment

53

piloting is commonly used to revise items and improve them; also known as field testing or trialing. Placement test: is administered to incoming students in order to place or put them in the correct ability level; content on placement tests is specific to a given curriculum; placement tests are most successfully produced in-house. Portfolio assessment: one type of alternative assessment; portfolios are a representative collection of a student's work throughout an extended period of time; the aim is to document the student's progress in language learning via the completion of such tasks as reports, projects, artwork, and essays. Practicality: one of the cornerstones of good testing practice; practicality refers to the practical issues that teachers and administrators must keep in mind when developing and administering tests; examples include time, and available resources. Practice effect: the phenomenon of taking two tests with the same or similar content and the result being a higher score on the second test with no actual increase in language ability. Predictive validity: measures how well a test predicts performance on an external criterion. Pretest: administering a test or set of test items before it goes live for the purpose of collecting information about the students or identifying problems with the items. Proficiency test: is not specific to a particular curriculum; it assesses a student's general ability level in the language as compared to all other students who study that language. An example is the TOEFL. Profile marking: sometimes called analytical marking; after marking teachers have a profile of the students mark. Range: one of the descriptive statistics or measures of central tendency; the range or min/max is the lowest and highest score in a distribution. Rater: a person who evaluates or judges student performance on an assessment against specific performance. Rater training: the process of educating raters to evaluate student work and produce dependable scores. Rating scale: instruments that are used for the evaluation of writing and speaking; they are either analytical or holistic or a combination of the two. Raw score: the number of items answered correctly.

Coombe/Hubley Fundamentals of Language Assessment

54

Readability: the level of reading difficulty for a given text; most readability indices are based on vocabulary (frequency or length) and syntax (average sentence length); well-known readability formulas include Flesch-Kincaid and Fry. Reliability: one of the cornerstones of good testing practice; reliability refers to the consistency of exam results over repeated administrations and the degree to which the results of an assessment are dependable and consistently measure particular student knowledge and/or skills. Reported score: the actual score that is reported to the student. Retired test: a test that is no longer live or in the public domain; the term implies that the test was once secure and statistically valid; retired tests are often used as practice materials. Security: measures taken to ensure that the test remains live and operational and not in the hands of test takers. Self-assessment: asks students to judge their own ability level in a language; one type of alternative assessment. Severity: this is a characteristic of a rater; some many be consistently generous or lenient with scores; others may be consistently harsh. Specifications: a document that states what the test should be used for and who is it aimed at; test specifications usually contain all instructions, examples of test formats/items, weighting information and pass/fail criteria. Speededness: the extent to which test takers lack sufficient time to respond to items; for most tests, speededness is not a desirable characteristic. Stakeholders: all those who have a stake or an interest in the use or effect of a particular test or assessment. Standardized test: measures language ability against a norm or standard. Standard Error of Measurement (SEM): a way of expressing test reliability Stem: the first part of a multiple-choice question; usually takes the form of a question or a sentence completion. Stimulus: material provided as part of the test or task that the test taker has to respond to. Subjective test: requires knowledge of the content area being tested; a subjective test frequently depends on impression, human judgment and opinion at the time of the scoring.

Coombe/Hubley Fundamentals of Language Assessment

55

Summative evaluation: refers to a test that is given at the end of a course or course segment; the aim of summative evaluation is to give the student a grade that represents his/her mastery of the course content. Test anxiety: a feeling or nervousness or fear surrounding an assessment; can occur before, during or after a test; has the potential to effect test performance. Test equivalence: tests that are constructed from the same set of test specifications with the goal being to test the same skills; the scores on these two tests are expected to be the same or similar. Test-Retest: parallel tests are administered before learning has occurred and after it has taken place for the purpose of determining or measuring how much language has been learned over time Test wiseness: refers to the amount and type of preparation or prior experience with the test the test taker has. Transparency: the idea that teachers and students have a right to know how they will be assessed and which criteria will be used to evaluate them. True score: the score a person would receive if the test were perfectly reliable or the SEM was zero. Validity: one of the cornerstones of good testing practice; refers to the degree to which a test measures what it is supposed to measure. Washback: one of the cornerstones of good testing practice; refers to the impact a test or testing program may have on the curriculum. Weighting: refers to the value that is placed on certain skills within the exam determined through prior administrations to large numbers of students.

Coombe/Hubley Fundamentals of Language Assessment

56

Annotated Bibliography
OUR FAVORITE BOOKS ON LANGUAGE ASSESSMENT Alderson, J. Charles, Caroline Clapham and Dianne Wall. 1995. Test Construction and Evaluation. Cambridge, U.K.: Cambridge University Press. This volume describes and illustrates principles of test design, construction and evaluation. Each chapter deals with one stage of the test construction process. The final chapter examines current practice in EFL assessment. Bachman, Lyle F. 1990. Fundamental Considerations in Language Testing. Oxford, U.K.: Oxford University Press. This book explores the basic considerations that underlie the practical development and use of language tests; the nature of measurement, the contexts that determine the use of language tests, and the nature of both the language abilities to be measured and the testing methods that are used to measure them. Bachman also provides a synthesis of testing research. Bachman, Lyle F. and Adrian S. Palmer. 1996. Language Testing in Practice: Designing and Developing Useful Language Tests. Oxford, U.K.: Oxford University Press. This book relates language testing practice to current views of communicative language teaching and testing. It builds on the theoretical background set forth in Bachman's 1990 volume. The authors discuss the design, planning and organization of tests. Bailey, Kathleen. (1998). Learning About Language Assessment: Dilemmas, Decisions, and Directions. TeacherSource Series (ed.) Donald Freeman. Boston: Heinle. This text provides a practical analysis of language assessment theory and accessible explanations of the statistics involved. Brown, H. Douglas. (2003) Language Assessment: Principles and Classroom Practice. Hertfordshire, UK: Prentice Hall. An accessible book on assessment by an experienced teacher and teacher trainer.

Coombe/Hubley Fundamentals of Language Assessment

57

Cambridge Language Assessment Series. (series editors) J. Charles Alderson & Lyle Bachman. Cambridge: Cambridge University Press. This excellent series of professional volumes includes: Assessing Languages for Specific Purposes, Dan Douglas Assessing Vocabulary, John Read Assessing Reading, Charles Alderson Assessing Listening, Gary Buck Assessing Writing, Sara Cushing Weigle Assessing Speaking, Sari Luoma Assessing Grammar, Jim Purpura Cohen, Andrew D. 1994. Assessing Language Ability in the Classroom. Boston, MA.: Heinle. This second edition presents various principles for guiding teachers through the assessment process (dictation, cloze summary, oral interview, roleplays, portfolio assessment). Cohen deals with issues in assessment, not just with testing. He also examines the test-taking process and presents up-to-date topics in language assessment. Coombe, Christine and Nancy Hubley. 2003. Assessment Practices. TESOL Case Studies Series. TESOL Publications. This edited volume includes case studies about successful language assessment practices from a global perspective. Coombe, Christine, Keith Folse and Nancy Hubley. 2007. A Practical Guide to Assessing English Language Learners. Ann Arbor, MI: University of Michigan Press. This co-authored volume includes chapters on the basics of language assessment. The content revolves around two fictitious language teachers; one who has very good instincts about how students should be assessed and the other who is just starting out and makes mistakes along the way. The manual you are reading today is the basis for this book. Davidson, Fred and Brian Lynch. 2002. Testcraft: A Teachers Guide to Writing and Using Language Test Specifications. New Haven: Yale University Press. This book is about language test development using test specifications. It is intended for language teachers at all career levels. Fulcher, Glenn. 2003. Testing Second Language Speaking. London: Longman/Pearson Education.

Coombe/Hubley Fundamentals of Language Assessment

58

This book offers a comprehensive treatment of testing speaking in a second language. It will be useful for anyone who has to develop speaking tests in their own institutions. Genesee, Fred and John A. Upshur. 1996. Classroom-Based Evaluations in Second Language Education. Cambridge, U.K.: Cambridge University Press. The authors emphasize the value of classroom-based assessment as a tool for improving both teaching and learning. The book is non-technical and presupposes no specialized knowledge in testing or statistics. The suggested assessment procedures are useful for a broad range of proficiency levels, teaching situations, and instructional approaches. Harris, Michael and Paul McCann. 1994. Assessment. Oxford, U.K.: Heinemann Publishers. This volume examines the areas of formal and informal assessment as well as self-assessment. Within each section, practical guidance is given on the issues of purpose, timing, methods and content. The ready-to-use materials include model tests, self-assessment and assessment instruments which teachers can adapt to suit their instructional context. Heaton, J. B. 1988. Writing English Language Tests. Harlow, England: Longman Press. This volume gives detailed and practical suggestions on methods of classroom testing and shows how both students and teachers can gain the maximum benefit from testing. Examples of useful testing techniques are included as well as practical advice on using them in the classroom. Hughes, Arthur. 2003. Testing for Language Teachers (2nd Edition). Cambridge, U.K.: Cambridge University Press. This practical guide is designed for teachers who want to have a better understanding of the role of testing in language teaching. The principles and practice of testing are presented in a logical, accessible way and guidance is given for teachers who devise their own tests. McNamara, Tim. 2000. Language Testing. Oxford Introductions to Language Study, (ed.) H.G. Widdowson. Oxford: Oxford University Press. This book examines issues such as test design, the rating process, validity and measurement. The looks at both traditional and newer forms of language assessment, the wider social and political context of testing and the challenges posed by new ideas. Studies in Language Testing Series. (Series Editors). Michael Milanovic and Cyril Weir. Cambridge: Cambridge University Press.

Coombe/Hubley Fundamentals of Language Assessment

59

This series focuses on important developments in language testing. The series has been produced by UCLES in conjunction with Cambridge University Press. Titles in the series are of considerable interest to test users, language test developers and researchers. Some of the excellent volumes in this series include: Using Verbal Protocols in Language Test Validation: A Handbook, Alison Green Dictionary of Language Testing, Alan Davies et al. Fairness and Validation in Language Assessment, Antony John Kunnan Experimenting with Uncertainty: Language Testing Essays in Honour of Alan Davies, Catherine Elder et al. The Equivalence of Direct and Semi-Direct Speaking Tests, Kieran OLoughlin The Development of IELTS: A Study of the Effect of Background on Reading Comprehension, Caroline Clapham The Multilingual Glossary of Language Testing Terms Learner Strategy Use and Performance on Language Tests: A Structural Equation Modelling Approach, Jim Purpura Issues in Computer-Adaptive Testing of Reading Proficiency, Micheline Chalhoub-Deville A Qualitative Approach to the Validation of Oral Language Tests, Anne Lazaraton Weir, Cyril. 1993. Understanding and Developing Language Tests. Hertfordshire, UK: Prentice Hall. This book is designed for language teachers, teacher educators, and language teacher trainees interested in the theory and practice of language tests. The book takes a critical look at a range of published exams and helps readers understand not only how tests are-and should be-constructed, but how they relate to classroom teaching.

Coombe/Hubley Fundamentals of Language Assessment

60

Contact Information
The presenters can be contacted in the following ways: By mail Christine Coombe Dubai Mens College, HCT P.O. Box 15825 Dubai, United Arab Emirates Nancy Hubley Alice Lloyd College 100 Purpose Road, #83 Pippa Passes, KY 41844

By email Christine Coombe christine.coombe@hct.ac.ae Nancy Hubley njhubleyae@yahoo.com

Acknowledgments: The authors are grateful to the many people who


have provided support for this project. Thanks are particularly due to our colleagues at UAE University, Zayed University and the Higher Colleges of Technology who participated in numerous workshops and piloted these materials. We are grateful to our students who have taught us so much about testing. Lastly, we appreciate the feedback from ELT colleagues in many countries who have shared their assessment experiences with us.

Coombe/Hubley Fundamentals of Language Assessment

61

You might also like