Language Assessment Module

TOPIC 1
1.0
OVERVIEW OF ASSESSMENT:
CONTEXT, ISSUES AND TRENDS
SYNOPSIS
Topic 1 provides you with some meanings of test, measurement, evaluation

and assessment, some basic historical development in language assessment,
and the changing trends of language assessment in the Malaysian context.
1.1
LEARNING OUTCOMES
By the end of this topic, you will be able to:
1.2
1.
define and explain the important terms of test, measurement,

evaluation, and assessment;
2.
examine the historical development in Language Assessment;
3.
describe the changing trends in Language Assessment in the

Malaysian context and discuss the contributing factors.
FRAMEWORK OF TOPICS
CONTENT
SESSION ONE (3 hours)
1.3
INTRODUCTION
Assessment and examinations are viewed as highly important in most Asian

countries such as Malaysia. Language tests and assessment have also
become a prevalent part of our education system. Often, public examination
results are taken as important national measures of school accountability.
While schools are ranked and classified according to their students
performance in major public examinations, scores from language tests are
used to infer individuals language ability and to inform decisions we make
about those individuals.
In this topic, lets discuss about the concept of measurement at its
numerous definitions. We will also look into the historical development in
language assessment and the changing trends of language assessment in our
country.
1.4
DEFINITION OF TERMS test, measurement, evaluation, and

assessment.
1.4.1 Test
The four terms above are frequently used interchangeably in any
academic discussions. A test is a subset of assessment intended to measure
a test-taker's language proficiency, knowledge, performance or skills. Testing
is a type of assessment techniques. It is a systematically prepared procedure
that happens at a point in time when a test-taker gathers all his abilities to
achieve ultimateperformance because he knows that his responses are being
evaluated and measured.A test is first a method of measuring a test-takers
ability, knowledge or performance in a given area; and second it must
measure.
Bachman (1990) who was also quoted by Brown defined a test as a
process of quantifying a test-takers performance according to explicit
procedures or rules.
2
1.4.2 Assessment
Assessment is every so oftena misunderstood term. Assessment is a
comprehensive process of planning, collecting, analysing, reporting, and using
information on students over time(Gottlieb, 2006, p. 86).Mousavi (2009)is of
the opinion that assessment is appraising or estimating the level of magnitude
of some attribute of a person. Assessment is an important aspect in the fields
of language testing and educational measurement and perhaps, the most
challenging partof it. It is an ongoing process in educational practice, which
involves a multitude of methodological techniques. It can consist of tests,
projects, portfolios, anecdotal information and student self-reflection.A test
may be assessed formally or informally, subconsciously or consciously, as well
as incidental or intended by an appraiser.
1.4.3 Evaluation
Evaluation is another confusing term. Many are confused between
evaluation and testing. Evaluation does not necessary entail testing. In reality,
evaluation is involved when the results of a test (or other assessment
procedure) are used for decision-making (Bachman, 1990, pp. 22-23).
Evaluation involves the interpretation of information. If a teacher simply
records numbers or makes check marks on a chart, it does not constitute
evaluation. When a tester or marker evaluate, s/he values the results in such
a way that the worth of the performance is conveyed to the test-taker. This is
usually done with some reference to the consequences, either good or bad of
the performance.This is commonly practised in applied linguistics research,
where the focus is often on describing processes, individuals, and groups, and
the relationships among language use, the language use situation, and
language ability.
Test scores are an example of measurement, and conveying the
meaning of those scores is evaluation. However, evaluation can occur
without measurement. For example, if a teacher appraises a students correct
oral response with words like Excellent insight, Lilly!it is evaluation.
3
1.4.4 Measurement
Measurement is the assigning of numbers to certain attributes of
objects, events, or people according to a rule-governed system. For our
purposes of language testing, we will limit the discussion to unobservable
abilities or attributes, sometimes referred to as traits, such as grammatical
knowledge, strategic competence or language aptitude. Similar to other tyoes
of assessment, measurement must be conducted according to explicit rules
and procedures as spelled out in test specifications, criteria, and procedures
for scoring.Measurement could be interpreted as the process of quantifying the
observed performance of classroom learners. Bachman (1990) cautioned us
to distinguish between quantitative and qualitative descriptions. Simply put,
the former involves assigning numbers (including rankings and letter grades)
to observed performance, while the latter consists of written descriptions, oral
feedback, and non-quantifiable reports.
The relationships among test, measurement, assessment, and their
uses are illustrated in Figure 1.
Figure 1:The relationship between tests, measurement and assessment.

(Source: Bachman, 1990)
2.0
Historical development in language assessment

From the mid-1960s, through the 1970s, language testing practices
reflected in large-scale institutional language testing and in most language

testing textbooks of the time - was informed essentially by a theoretical view of
language ability as consisting of skills (listening, speaking, reading and
4
writing) and components (e.g. grammar, vocabulary, pronunciation) and an

approach to test design that focused on testing isolated discrete points of
language, while the primary concern was with psychometric reliability (e.g.
Lado,1961; Carroll,1968). Language testingresearchwas dominated largely
bythe hypothesis that language proficiency consisted of a single unitarytrait,
and a quantitative, statisticalresearch methodology (Oller, 1979).
The 1980s saw other areas of expansion in language testing,
mostimportantly, perhaps, in the influence of second language
acquisition(SLA) research, which spurred language testers to investigate
not only a wide variety of factors such as field independence/dependence (e.g.
Stansfield and Hansen, 1983; Hansen, 1984; Chapelle, 1988), academic
discipline and background knowledge (e.g. Erickson and Molly, 1983;
Alderson and Urquhart, 1985; Hale, 1988) and discoursedomains (Douglas
and Selinker, 1985) on language test performance, but also the strategies
involved in the process of test-taking itself(e.g. Grotjahn, 1986; Cohen,
1987).
If the 1980s saw a broadening of the issues and concerns of language
testing into other areas of applied linguistics, the 1990s saw a continuation of
this trend. In this decade the field also witnessed expansionsin a number of
areas:
a)
research methodology;
b)
practical advances;
c)
factors that affect performance on language tests;
d)
authentic, or performance, assessments; and
e)
concerns with the ethics of language testing and professionalising

the field
The beginning of the new millennium is another exciting time for anyone
interested in language testing and assessment research. Current

developments in the fields of applied linguistics, language learning and
pedagogy, technological innovation, and educational measurement have
opened up some rich new research avenues.
5
3.0
Changing trends in Language Assessment-Malaysian context

History has clearly shown thatteaching and assessment should be
intertwined in education.Assessment and examinations are viewed as highly

important in Malaysia. One does not need to look very far to see how
important testing and assessment havebecome in our education system.
Often, public examination results are taken as important national measures of
school accountability. Schools are ranked and classified according to their
students performance in major public examinations. Just as assessment
impacts student learning and motivation, it also influences the nature of
instruction in the classroom. There has been considerable recent literature that
has promoted assessment as something that is integrated with instruction, and
not an activity that merely audits learning (Shepard, 2000). When assessment
is integrated with instructions, it informs teachers about what activities and
assignments will be most useful, what level of teaching is most appropriate,
and how summative assessments provide diagnostic information.
With this in mind, we have to look at the changing trends in assessment
particularly language assessment in this country, which has been carried out
mainly through the examination system until recent years.Starting from the
year 1845, written tests in schools were introduced for a number of subjects.
This trend in assessment continued with the intent to gauge (determine) the
effectiveness of the teaching-learning process. In Malaysia, the development
of formal evaluation and testing in education began after Independence.
Public examinations have long been the only measurement of students
achievement. Figure 1 shows the four stages/phases of development of
examination system in our country. The stages are as follow:
Pre-Independence
Razak Report
RahmanTalib Report
Cabinet Report
Malaysia Education Blueprint (2013-2025)
On 3rd May 1956, the Examination Unit (later known as Examination

Syndicate) in the Ministry of Education (MOE) was formed on the
recommendation of the Razak Report (1956). The main objective of the
Malaysia Examination Syndicate (MES) was to fulfil one of the Razak Reports
recommendations, which was to establish a common examination system for
all the schools in the country.
In line with the on-going transformation of the national educational
system, the current scenario is gradually changing. A new evaluation system
known as the School Based Assessment (SBA) was introduced in 2002 as a
move away from traditional teaching to keep abreast with changing trends of
assessment and to gauge the competence of students by taking into
consideration both academic and extra curricular achievements.
According to the Malaysian Ministry of Education (MOE), the new
assessment system aims to promote a combination of centralised and schoolbased assessment. Malaysian Teacher Education Division (TED) is entrusted
by the Ministry of Education to formulate policies and guidelines to prepare
teachers for the new implementation of assessment. As emphasised in the
innovation of the student assessment, continuous school-based assessment is
administered at all grades and all levels. Additionally, students sit for common
public examinations at the end of each level. It is also a fact that the role of
teachers in the new assessment system is vital. Teachers will be given
empowerment in assessing their students.
The Malaysia Education Blueprint was launched in September this year,
and with it, a three-wave initiative to revamp the education system over the
next 12 years. One of its main focuses is to overhaul the national curriculum
and examination system, widely seen as heavily content-based and unholistic.It is a timely move, given our poor results at the 2009 Programme for
International Student Assessment (PISA) tests. Based on the 2009
assessment, Malaysia lags far behind regional peers like Singapore, Japan,
South Korea, and Hong Kong in every category.
Poor performance in Pisa is normally linked to students not being able
to demonstrate higher order thinking skill. To remedy this, the Ministry of
7
Education has started to implement numerous changes to the examination

system. Two out of the three nationwide examinations that we currently
administer to primary and secondary students have gradually seen major
changes. Generally, the policies are ideal and impressive, but there are still a
few questions on feasibility that have been raised by concern parties.
Figure 2 below shows the development of educational evaluation in Malaysia
since pre-independence until today.
Examinations
Examinations were
were conducted
conducted according
according to
to the
the
needs of school or based on overseas
examinations such as the Overseas School
Certificate.
Certificate.
PreIndependence
Implementation
of the Razak
Report (1956)
Razak
Razak Report
Report gave
gave birth
birth to
to the
the National
National
Education
Education Policy
Policy and
and the
the creation
creation of
of
Examination Syndicate (LP). LP conducted
examinations such as the Cambridge and
RahmanTalib ReportSchool
recommended the
Malayan
Malayan Secondary
Secondary School Entrance
Entrance
following
actions:
following
actions:
Examination (MSSEE), and Lower Certificate of
1. Extend (LCE)
schooling
age to 15 years old.
Education
Examination.
2. Automatic promotion to higher classes.
3.
3. Multi-stream
Multi-stream education
education (Aneka
(Aneka Jurusan).
Jurusan).
The
The following
following changes
changes in
in examination
examination were
were
made:
- The entry of elective subjects in LCE and
SRP.
SRP.
- Introduction examination of the Standard 5
Evaluation Examination.
-- The
The introduction
introduction of
of Malaysia's
Malaysia's Vocational
Vocational
Education Examination.
- The introduction of the Standard 3 Dignostic
Test
Test (UDT).
(UDT).
Implementation
of the
RahmanTalib
Report (1960)
The
The implementation
implementation of
of Cabinet
Cabinet Report
Report
resulted in evolution of the education system
to its present state, especially with KBSR
and
and KBSM.
KBSM. Adjustments
Adjustments were
were made
made in
in
examination
examination to
to fulfill
fulfill the
the new
new curriculum's
curriculum's
needs and to ensure it is in line with the
National Education Philosophy.
Implementation
of the Cabinet
Report (1979)
Implementation of
the Malaysia
Education Blueprint
(2013 2025)
The
The emphasis
emphasis is
is on
on School-Based
School-Based Assessment
Assessment
(SBA). It was first introduced in 2002. It is a new
system
system of
of assessment
assessment and
and is
is one
one of
of the
the new
new
areas
areas where
where teachers
teachers are
are directly
directly involved.
involved. The
The
revamp
revamp of
of the
the national
national examination
examination and
and schoolschoolbased
based assessments
assessments in
in stages,
stages, whereby
whereby by
by 2016,
2016,
at
at least
least 40%
40% of
of questions
questions in
in
UjianPenilaianSekolahRendah
UjianPenilaianSekolahRendah (UPSR)
(UPSR) and
and 50%
50%
in SijilPelajaran Malaysia (SPM) are of high order
thinking
thinking skills
skills questions.
questions.
Figure 2: The development of educational evaluation in Malaysia

By and large, the role of MES is to complement and complete the
Source: Malaysia Examination Board (MES)
implementation of the
national education policy. Among its achievements are:
http://apps.emoe.gov.my/1pm/maklumatam.htm
TOPICvi 2
ROLE
AND PURPOSES OF
i
ASSESSMENT IN
TEACHING AND LEARNING
ii
iii
v
iv
Figure 3: The achievements of Malaysia Examination Syndicate (MES)

Source:Malaysia Examination Board (MES)
http://apps.emoe.gov.my/1pm/maklumatam.htm
Tutorial question
Examine the contributing factors to the changing trends of
language assessment.
Create and present findings using graphic organisers.
10
person is in a particular language skill area.Their purpose is to describe what

students are capable of doing in a language.
Proficiency tests are usually developed by external bodies such as
examination boards like Educational Testing Services (ETS) or Cambridge
ESOL. Some proficiency tests have been standardised for international use,
such as the American TOEFL test which is used to measure the English
language proficiency of foreign college students who wish to study in NorthAmerican universities or the British-Australian IELTS test designed for those
who wish to study in the United Kingdom or Australia (Davies et al., 1999).
Achievement Tests
Achievement tests are similar to progress tests in that their purpose is
to see what a student has learned with regard to stated course outcomes.
However, they are usually administered at mid-and end- point of the semester
or academic year. The content of achievement tests is generally based on the
specific course content or on the course objectives. Achievement tests are
often cumulative, covering material drawn from an entire course or semester.
Diagnostic Tests
Diagnostic tests seek to identify those language areas in which a
student needs further help. Harris and McCann (1994 p. 29) point out that
where other types of tests are based on success, diagnostic tests are based
on failure. The information gained from diagnostic tests is crucial for further
course activities and providing students with remediation. Because diagnostic
tests are difficult to write, placement tests often serve a dual function of both
placement and diagnosis (Harris & McCann, 1994; Davies et al., 1999).
Aptitude Tests
This type of test no longer enjoys the widespread use it once had. An
aptitude test is designed to measure general ability or capacity to learn a
11
foreign language a priori (before taking a course) and ultimate predicted

success in that undertaking. Language aptitude tests were seemingly
designed to apply to the classroom learning of any language. In the United
States, two common standardised English Language tests once used were the
Modern Language Aptitude Test (MLAT; Carroll & Sapon, 1958) and the
Pimsleur Language Aptitude Battery (PLAB; Pimsleur, 1966). Since there is
no research to show unequivocally that these kinds of tasks predict
communicative success in a language, apart from untutored language
acquisition, standardised aptitude tests are seldom used today with the
exception of identifying foreign language disability (Stansfield & Reed, 2004).
Progress Tests
These tests measure the progress that students are making towards
defined course or programme goals. They are administered at various stages
throughout a language course to see what the students have learned, perhaps
after certain segments of instruction have been completed. Progress tests are
generally teacher produced and are narrower in focus than achievement tests
because they cover a smaller amount of material and assess fewer objectives.
Placement Tests
These tests, on the other hand, are designed to assess students level
of language ability for placement in an appropriate course or class. This type
of test indicates the level at which a student will learn most effectively. The
main aim is to create groups, which are homogeneous in level. In designing a
placement test, the test developer may choose to base the test content either
on a theory of general language proficiency or on learning objectives of the
curriculum. In the former, institutions may choose to use a well-established
proficiency test such as the TOEFL or IELTS exam and link it to curricular
benchmarks. In the latter, tests are based on aspects of the syllabus taught at
the institution concerned.
In some contexts, students are placed according to their overall rank in
the test results. At other institutions, students are placed according to their
12
TOPIC 3
BASIC TESTING TERMINOLOGY
level in each individual skill area. Elsewhere, placement test scores are used
to determine if a student needs any further instruction in the language or could
matriculate directly into an academic programme.
Discuss the extent tests or assessment tasks serve their purpose.
The end of the topic. Happy reading!
13
CONTENT
Formative and
Summative
SESSION THREE (3 hours)
Objective and
Subjective
3.3 Norm-Referenced Test

(NRT)
According to Brown (2010), in NRTs an individual test-takers score is
interpreted in relation to a mean (average score), median (middle score),
standard deviation (extent of variance in scores), and/or percentile rank. The
purpose of such tests is to place test-takers along a mathematical continuum
in rank order. In a test, scores are commonly reported back to the test-taker in
the form of a numerical score for example, 250 out of 300 and a percentile
rank for instance 78 percent, which denotes that the test-takers score was
higher than 78 percent of the total number of test-takers but lower than 22
pecent in the administration. In other words, NRT is administered to compare
an individual performance with his peers and/or compare a group with other
groups. In the School-Based Evaluation, NRT is used for the summative
evaluation, such as in the end of the year examination for the streaming and
selection of students.
3.4
Criterion-Referenced Test (CRT)

Gottlieb (2006) on the other hand refers Criterion-referenced tests as
the collection of information about student progress or achievement in relation

to a specified criterion. In a standards-based assessment model, the
standards serve as the criteria or yardstick for measurement. Following
Glaser (1973), the word criterion means the use of score values that can be
accepted as the index of attainment to a test-taker. Thus, CRTs are designed
to provide feedback to test-takers, mostly in the form of grades, on specific
course or lesson objectives. Curriculum Development Centre (2001) defines
CRT as an approach that provides information on students mastery based on
the criteria determined by the teacher. These criteria are based on learning
14
outcomes or objectives as specified in the syllabus. The main advantage of

CRTs is that they provide the testers to make inferences about how much
language proficiency, in the case of language proficiency tests, or knowledge
and skills, in the aspect of academic achievement tests, that testtakers/students originally have and their successive gains over time. As
opposed to NRTs, CRTs focus on students mastery of a subject matter
(represented in the standards) along a continuum instead of ranking student
on a bell curve. Table 3 below shows the differences between NormReferenced Test (NRT) and Criterion-Referenced Test (CRT).
Norm-Referenced Test
A test that measures
students achievement as
compared to other
students in the group
Criterion-Referenced Test
Definition
An approach that
provides information on
students mastery based
on a criterion specified by
the teacher
Purpose
Determine performance
Determine learning
difference among
mastery based on
individual and groups
specified criterion and
standard
Test Item
From easy to difficult level Guided by minimum
and able to discriminate
achievement in the
examinees ability
related objectives
Frequency
Continuous assessment
Continuous assessment
in the classroom
Appropriateness
Summative evaluation
Formative evaluation
Example
Public exams: UPSR,
Mastery test: monthly
PMR, SPM, and STPM
test, coursework, project,
exercises in the
classroom
Table 3: The differences between Norm-Referenced Test (NRT) and CriterionReferenced Test (CRT)
3.5
Formative Test
Formative test or assessment, as the name implies, is a kind of
feedback teachers give students while the course is progressing. Formative

assessment can be seen as assessment for learning. It is part of the
instructional process.We can think of formative assessment as practice.
With continual feedback the teachers may assist students to improve their
performance. The teachers point out on what the students have done wrong
15
and help them to get it right. This can take place when teachers examine the
results of achievement and progress tests. Based on the results of formative
test or assessment, the teachers can suggest changes to the focus of
curriculum or emphasis on some specific lesson elements. On the other hand,
students may also need to change and improve. Due to the demanding nature
of this formative test, numerous teachers prefer not to adopt this test although
giving back any assessed homework or achievement test present both
teachers and students healthy and ultimate learning opportunities.
3.6
Summative Test
Summative test or assessment, on the other hand, refers to the kind of
measurement that summarise what the student has learnt orgive a one-off
measurement.In other words, summative assessment is assessment of
student learning. Students are more likely to experience assessment carried
out individually where they are expected to reproduce discrete language items
from memory.The results then are used to yield a school report and to
determine what students know and do not know.It does not necessarily provide
a clear picture of an individuals overall progress or even his/her full potential,
especially if s/heis hindered by the fear factor of physically sitting for a test, but
may provide straightforward and invaluable results for teachers to analyse. It is
given at a point in time to measure student achievement in relation to a clearly
defined set of standards, but it does not necessarily show the way to future
progress. It is given after learning is supposed to occur. End of the year tests
in a course and other general proficiency or public exams are some of the
examples of summative tests or assessment.Table 3.1 shows formative and
summative assessments that are common in schools.
Formative Assessment
Anecdotal records
Quizzes and essays
Summative Assessment
Final exams
National exams (UPSR, PMR, SPM,
STPM)
Diagnostic tests
Entrance exams
Table 3.1: Common formative and summative assessments in schools
16
3.7
Objective Test
According to BBC Teaching English, an objective test is a test that
consists of right or wrong answers or responses and thus it can be marked

objectively. Objective tests are popular because they are easy to prepare and
take, quick to mark, and provide a quantifiable and concrete result. They tend
to focus more on specific facts than on general ideas and concepts.
The types of objective tests include the following:
i.
Multiple choice items/questions
ii.
True-false items/questions:
iii.
Matchingitems/questions; and
iv.
Fill-in the blanks items/questions.
In this topic, let us focus on the multiple-choice questions, which may

look easy to construct but in reality, it is very difficult to build correctly. This is
congruent with the viewpoint of Hughes (2003, pp76-78) who warns against
many weaknesses of multiple-choice questions. The weaknesses include:
It may limit beneficial washback;
It may enable cheating among test-takers;
It is very challenging to write successful items;
This technique strictly limits what can be tested;
This technique tests only recognition knowledge;
It may encourage guessing,which may have a considerable effect on

test scores.
Lets look at some important terminology when designing multiple-choice

questions. This objective test item comprises five terminologies namely:
17
1.
Receptive or selective response

Items that the test-takers chooses from a set of responses, commonly
called a supply type of response rather than creating a response.
2.
Stem
Every multiple-choice item consists of a stem (the body of the item that
presents a stimulus). Stem is the question or assignment in an item. It is in a

complete or open, positive or negative sentence form. Stem must be short or
simple, compact and clear. However, it must not easily give away the right
answer.
3.
Options or alternatives
They are known as a list of possible responses to a test item.
There are usually between three and five options/alternatives to
choose from.
4.
Key
This is the correct response. The response can either be correct
or the best one. Usually for a good item, the correct answer is not obvious as
compared to the distractors.
5. Distractors
This is known as a disturber that is included to distract students from
selecting the correct answer. An excellent distractor is almost the same as the
correct answer but it is not.
When building multiple-choice items for both classroom-based and

large-scaled standardised tests, consider the four guidelines below:
i.
Design each item to measure a single objective;
ii.
State both stem and options as simply and directly as possible;
iii.
Make certain that the intended answer is clearly the one correct
one;
18
iv.
3.8
(Optional) Use item indices to accept, discard or revise item.
Subjective Test
Contrary to an objective test, a subjective test is evaluated by giving an
opinion, usually based on agreed criteria.Subjective tests include essay, shortanswer, vocabulary, and take-home tests. Some students become very
anxious of these tests because they feel their writing skills are not up to par.
In reality, a subjective test provides more opportunity to test-takers to
show/demonstrate their understanding and/or in-depth knowledge and skills in
the subject matter. In this case, test takers might provide some acceptable,
alternative responses that the tester, teacher or test developer did not
predict. Generally, subjective tests will test the higher skills of analysis,
synthesis, and evaluation. In short, subjective test will enable students to be
more creative and critical. Table 3.2 shows various types of objective and
subjective assessments.
Objective Assessments
Subjective Assessments
True/False Items
Extended-response Items
Multiple-choice Items
Restricted-response Items
Multiple-responses Item
Essay
Matching Items
Table 3.2: Various types of objective and subjective assessments
Some have argued that the distinction between objective and subjective
assessments is neither useful nor accurate because, in reality, there is no such
thing as objective assessment. In fact, all assessments are created with
inherent biases built into decisions about relevant subject matter and content,
as well as cultural (class, ethnic, and gender) biases.
Reflection
1.
Objective test items are items that have only one answer or correct
response. Describe in-depth the multiple-choice test item.
19
2.
Subjective test-items allocate subjectivity in the response given by

thetest-takers. Explain in detail the various types of subjective testitems.
Discussion
1. Identify at least three differences between formative and summative
assessment?
2. What are the strengths of multiple-choice items compared to essay
items?
3. Informal assessments are often unreliable, yet they are still
important in classrooms. Explain why this is the case, and defend
your explanation with examples.
4. Compare and contrast Norm-Referenced Test with CriterionReferenced Test.
TOPIC 4
4.0
BASIC PRINCIPLES OF ASSESSMENT
SYNOPSIS
Topic 4 defines the basic principles of assessment (reliability, validity,

practicality, washback, and authenticity) and the essential sub-categories
within reliability and validity.
4.1
LEARNING OUTCOMES
1.
define the basic principles of assessment (reliability, validity,

practicality, washback, and authenticity) and the essential subcategories within reliability and validity;
2.
explain the differences between validity and reliability ;

20
3.
4.2
distinguish the different types of validity and reliability in tests

and other instruments in language assessment.
FRAMEWORK OF TOPICS
Reliability
Interpretability
Validity
Types of
Tests
Practicality
Authenticity
CONTENT
Washback Effect
Objectivity
SESSION
FOUR (3 hours)
4.3
INTRODUCTION
Assessment is a complex, iterative process requiring skills,
understanding, and knowledge-in the exercise of professionally judgment. In

this process, there are five important criteria that the testers ought to look into
for testing a test: reliability, validity, practicality, washback and authenticity.
Since these five principles are context dependent, there is no priority order
implied in the order of presentation.
4.4
RELIABILITY ( consistency)
21
Reliability means the degree to which an assessment tool produces

stable and consistent results. It is a concept, which is easily being
misunderstood (Feldt & Brennan, 1989).
Reliability essentially denotes consistency, stability, dependability, and
accuracy of assessment results (McMillan, 2001a, p.65 in Brown, G. et al,
2008). Since there is tremendous variability from either teacher or tester to
teacher/tester that affects student performance, thus reliability in planning,
implementing, and scoring student performances gives rise to valid
assessment.
Fundamentally, a reliable (trustworthy) test is consistent and
dependable. If a tester administers the same test to the same test-taker or
matched test-takers on two circumstances, the test should give the same
results. In a validity chain, it is stated that test administrators need to be sure
that the scoring performance has to be carried out properly. If scores used by
the tester do not reflect accurately what the test-taker actually did, would not
be rewarded by another marker, or would not be received on a similar
assessment, then these scores lack reliability. Errors occur in scoring in any
ways-for example, giving Level 2 when another rater would give Level 4,
adding up marks wrongly, transcribing scores from test paper to database
inaccurately, students performing really well on the first half of the assessment
and poorly on the second half due to fatigue, and so on. Thus, lack of
reliability in the scores students receive is a treat to validity.
According to Brown (2010), a reliable test can be described as
follows:
Consistent in its conditions across two or more administrations
Gives clear directions for scoring / evaluation
Has uniform rubrics for scoring / evaluation
Lends itself to consistent application of those rubrics by the
scorer
Contains item / tasks that are unambiguous to the test-taker
4.4.1 Rater Reliability

22
When humans are involved in the measurement procedure,

there is a tendency of error, biasness and subjectivity in determining
the scores of similar test.There are two kinds of rater reliability namely
inter-rater reliability and intra-rater reliability.
Inter-rater reliability refers to the degree of similarity between
different tester or rater; can two or more testers/raters, without
influencing one another, give the same marks to the same set of scripts
(contrast with intra-rater reliability).
One way to test inter-rater reliability is to have each rater assign
each test item a score. For example, each rater might score
items on a
scale from 1 to 10. Next, you would calculate the
correlation (connection) between the two ratings to determine the level

of inter-rater reliability. Another means of testing inter-rater reliability is
to have raters determine which category each observation falls into and
then calculate the
percentage of agreement between the raters. So, if
the raters agree 8 out of 10 times, the test has an 80% inter-rater
reliability rate. Rater reliability is assessed by having two or more
independent judges score the test. The scores are then compared to
determine the consistency of the raters estimates.
Intra-rater reliability is an internal factor. In intra-rater reliability,
its main aim is consistency within the rater. For example, if a rater
(teacher) has many examination papers to mark and does not have
enough time to mark them, s/he might take much more care with the
first, say, ten papers, than the rest. This inconsistency will affect the
students scores; the first ten might get higher scores. In other
words, while inter-rater reliability involves two or more raters, intrarater reliability is the consistency of grading by a single rater.
Scores on a test are rated by a single rater/judge at different times.
When we grade tests at different times, we may become
inconsistent in our grading for various reasons. Some papers that are
graded during the day may get our full and careful attention, while
others that are graded towards the end of the day are very quickly
23
glossed over. As such, intra rater reliability determines the

consistency of our grading.
Both inter-and intra-rater reliability deserve close attention in that
test scores are likely to vary from rater to rater or even from the
same rater
(Clark, 1979).
4.4.2 Test Administration Reliability
There are a number of reasons which influences test
administration reliability. Unreliability occurs due to outside
interference like noise, variations in photocopying, temperature
variations, the amount of light in various parts of the room, and even
the condition of desk and chairs. Brown (2010) stated that he once
witnessed the administration of a test of aural comprehension in which
an audio player was used to deliver items for comprehension, but due
to street noise outside the building, test-taker sitting next to open
windows could not hear the stimuli clearly. According to him, that was
a clear case of unreliability caused by the conditions of the test
administration.
4.4.3 Factors influencing Reliability
Figure 4.4.3 Factors that affect the reliability of a test
24
The outcome of a test is influenced by many factors.

Assuming that the factors are constant and not subject to
change, a test is considered to be reliable if the scores
are consistent and not different from other equivalent and
reliable test scores. However, tests are not free from
errors. Factors that affect the reliability of a test include
test length factors, teacher and student factors,
environment factors, test administration factors, and
marking factors.
a. Test length factors
In general, longer tests produce higher reliabilities. Due to the
dependency on coincidence and guessing, the scores will be more
accurate if the duration of the test is longer. An objective test has
higher consistency because it is not exposed to a variety of
interpretations. A valid test is said to be reliable but a reliable test need
not be valid. A consistent score does not necessary measure what is
intended to measure. In addition, the test items that are the samples of
the subject being tested and variation in the samples may be found in
two equivalent tests and there can be one of the causes test outcomes
are unreliable.
b.
Teacher-Student factors
In most tests, it is normally for teachers to construct and
administer tests for students. Thus, any good teacher-student

relationship would help increase the consistency of the results. Other
factors that contribute to positive effects to the reliability of a test include
teachers encouragement, positive mental and physical condition,
familiarity to the test formats, and perseverance (determination) and
motivation.
c.
Environment factors
25
An examination environment certainly influences test-takers and

their scores. Any favourable environment with comfortable chairs and
desks, good ventilation, sufficient light and space will improve the
reliability of the test. On the contrary, a non-conducive environment will
affect test-takers performance and test reliability.
d.
Test administration factors
Because students' grades are dependent on the way tests are being
administered, test administrators should strive to provide clear and
accurate instructions, sufficient time and careful monitoring of tests to
improve the reliability of their tests. A test-re-test technique can be
used to determine test reliability.
e.
Marking factors
Unfortunately, we human judges have many opportunities to introduce

error in our scoring of essays (Linn & Gronlund, 2000; Weigle, 2002).It
is possible that our scoring invalidates many of the interpretations we
would like to make based on this type of assessment.Brennan (1996)
has reported that in large-scale, high-stakes marking panels that are
tightly trained and monitored marker effects are small. Hence, it can
be concluded that in low-stakes, small-scale marking, there is
potentially a large error introduced by individual markers. It is also
common that different markers award different marks for the same
answer even with a prepared mark scheme. A markers assessment
may vary from time to time and with different situations. Conversely, it
does not happen to the objective type of tests since the responses are
fixed. Thus, objectivity is a condition for reliability.
4.5
VALIDITY
Validity refers to the evidence base that can be provided about
appropriateness of the inferences, uses, and consequences that come from

assessment (McMillan, 2001a).Appropriateness has to do with the soundness
(accuracy), trustworthiness, or legitimacy of the claims or inferences
26
(conclusion that testers would like to make on the basis of obtained scores.
Clearly, we have to evaluate the whole assessment process and its constituent
(component) parts by how soundly (thoroughly) we can defend the
consequences that arise from the inferences and decisions we make. Validity,
in other words, is not a characteristic of a test or assessment; but a judgment,
which can have varying degrees of strength.
So, the second characteristic of good tests is validity, which refers to
whether the test is actually measuring what it claims to measure. This is
important for us as we do not want to make claims concerning what a student
can or cannot do based on a test when the test is actually measuring
something else. Validity is usually determined logically although several types
of validity may use correlation coefficients.
According to Brown (2010), a valid test of reading ability actually
measures reading ability and not 20/20 vision, or previous knowledge of a
subject, or some other variables of questionable relevance. To measure
writing ability, one might ask students to write as many words as they can in 15
minutes, then simply count the words for the final score. Such a test is
practical (easy to administer) and the scoring quite dependable (reliable).
However, it would not constitute (represent ) a valid test of writing ability
without taking into account its comprehensibility (clarity), rhetorical discourse
elements, and the organisation of ideas.
The following are the different types of validity:
Face validity: Do the assessment items appear to be appropriate?
Content validity: Does the assessment content cover what you want to
assess? Have satisfactory samples of language and language skills been
selected for testing?
Construct validity: Are you measuring what you think you're measuring?
Is the test based on the best available theory of language and language
use?
Concurrent (parallel) validity: Can you use the current test score to
estimate scores of other criteria? Does the test correlate with other existing
measures?
27
Predictive validity: Is it accurate for you to use your existing students

scores to predict future students scores? Does the test successfully predict
future outcomes?
It is fairly obvious that a valid assessment should have a good coverage of
the criteria (concepts, skills and knowledge) relevant to the purpose of the
examination. The important notion here is the purpose.
Figure 4.5: Types of Validity

4.5.1 Face validity
Face validity is validity which is determined impressionistically;
for example by asking students whether the examination was
appropriate to the expectations (Henning, 1987). Mousavi (2009) refers
face validity as the degree to which a test looks right, and appears to
measure the knowledge or abilities it claims to measure, based on the
subjective judgement of the examinees who take it, the administrative
28
personnel who decide on its use, and other psychometrically

unsophisticated observers.
It is pertinent (important ) that a test looks like a test even at first
impression. If students taking a test do not feel that the questions given
to them are not a test or part of a test, then the test may not be valid as
the students may not take it seriously to attempt the questions. The
test, hence, will not be able to measure what it claims to measure.
4.5.2 Content validity

Content validityis concerned with whether or not the content of
the test is sufficiently representative and comprehensive for the test to
be a valid measure of what it is supposed to measure (Henning,
1987).The most important step in making sure of content validity is to
make sure all content domains are presented in the test. Another
method to verify validity is through the use of Table of Test Specification
that can give detailed information on each content, level of skills, status
of difficulty, number of items, and item representation for rating in each
content or skill or topic.
We can quite easily imagine taking a test after going through an
entire language course. How would you feel if at the end of the course,
your final examination consists of only one question that covers one
element of language from the many that were introduced in the course?
If the language course was a conversational course focusing on the
different social situations that one may encounter, how valid is a final
examination that requires you to demonstrate your ability to place an
order at a posh restaurant in a five-star hotel?
4.5.3 Construct validity

Construct is a psychological concept used in measurement.
29
Construct validity is the most obvious reflection of whether a test

measures what it is supposed to measure as it directly addresses the
issue of what it is that is being measured. In other words, construct
validity refers to whether the underlying theoretical constructs that the
test measures are themselves valid. Proficiency, communicative
competence, and fluency are examples of linguistic constructs; selfesteem and motivation are psychological constructs.
Fundamentally every issue in language learning and teaching
involves theoretical constructs. When you are assessing a students
oral proficiency for instance. To possess construct validity, the test
should consist of various components of fluency: speed, rhythm,
juncture, (lack of) hesitations, and other elements within the construct of
fluency. Tests are, in a manner of speaking, operational definitions of
constructs in that their test tasks are the building blocks of the entity that
is being measured (see Davidson, Hudson, & Lynch, 1985; T.
McNamara, 2000).
4.5.4 Concurrent validity
Concurrent validity is the use of another more reputable and
recognised test to validate ones own test. For example, suppose you
come up with your own new test and would like to determine the validity
of your test. If you choose to use concurrent validity, you would look for
a reputable test and compare your students performance on your test
with their performance on the reputable and acknowledged test. In
concurrent validity, a correlation coefficient is obtained and used to
generate an actual numerical value. A high positive correlation of 0.7 to
1 indicates that the learners score is relatively similar for the two tests
or measures.
For example, in a course unit whose objective is for students to
be able to orally produce voiced and unvoiced stops in all possible
phonetics environments, the results of one teachers unit test might be
compared with an independent assessment such as a commercially
30
produced test of similar phonemic proficiency. Since criterion-related

evidence usually falls into one of two categories of concurrent and
predictive validity, a classroom test designed to assess mastery of a
point of grammar in a communicative use will have criterion validity if
test scores are verified either by observed subsequent behaviour or by
other communicative measures of grammar point in question.
4.5.5 Predictive validity
Predictive validity is closely related to concurrent validity in that it
too generates a numerical value. For example, the predictive validity of
a university language placement test can be determined several
semesters later by correlating the scores on the test to the GPA of the
students who took the test. Therefore, a test with high predictive validity
is a test that would yield predictable results in a latter measure. A simple
example of tests that may be concerned with predictive validity is the
trial national examinations conducted at schools in Malaysia as it is
intended to predict the students performance on the actual SPM
national examinations. (Norleha Ibrahim, 2009)
As mentioned earlier validity is a complex concept, yet it is
crucial to the teachers understanding of what makes a good test. It is
good to heed Messicks (1989, p. 36) caution that validity is not an allor-none proposition and that various forms of validity may need to be
applied to a test in order to be satisfied worth its overall effectiveness.
What are reliability and validity? What determines the reliability of a

test?
What are the different types of validity? Describe any three types
and
cite examples.
31
http://www.2dix.com/pdf-2011/testing-and-evaluation-in-esl-pdf.php
4.5.6 Practicality
Although practicality is an important characteristic of tests, it is by
far a limiting factor in testing. There will be situations in which after we
have already determined what we consider to be the most valid test, we
need to reconsider the format purely because of practicality issues. A
valid test of spoken interaction, for example, would require that the
examinees be relaxed, interact with peers and speak on topics that they
are familiar and comfortable with. This sounds like the kind of
conversations that people have with their friends while sipping afternoon
teaby the roadside stalls. Of course such a situation would be a highly
valid measure of spoken interaction if we can set it up. Imagine if we
even try to do so. It would require hidden cameras as well as a lot of
telephone calls and money.
Therefore, a more practical form of the test especially if it is to be
administered at the national level as a standardised test, is to have a
short interview session of about fifteen minutes using perhaps a picture
or reading stimulus that the examinees would describe or discuss.
Therefore, practicality issues, although limiting in a sense, cannot be
dismissed if we are to come up with a useful assessment of language
ability. Practicality issues can involve economics or costs, administration
considerations such as time and scoring procedures, as well as the
ease of interpretation. Tests are only as good as how well they are
interpreted. Therefore tests that cannot be easily interpreted will
definitely cause many problems.
4.5.7 Objectivity
The objectivity of a test refers to the ability of
teachers/examiners who mark the answer scripts. Objectivity refers to
the extent, in which an examiner examines and awards scores to the
32
same answer script. The test is said to have high objectivity when the
examiner is able to give the same score to the similar answers guided
by the mark scheme. An objective test is a test that has the highest
level of objectivity due to the scoring that is not influenced by the
examiners skills and emotions. Meanwhile, subjective test is said to
have the lowest objectivity. Based on various researches, different
examiners tend to award different scores to an essay test. It is also
possible that the same examiner would give different scores to the
same essay if s/he is to re-check at different times.
4.5.8 Washback effect
The term 'washback' or backwash (Hughes, 2003, p.1)
refers to the impact that tests have on teaching and learning. Such
impact is usually seen as being negative: tests are said to force
teachers to do things they do not necessarily wish to do.However, some
have argued that tests are potentially also 'levers for change' in
language education: theargument being that if a bad test has negative
impact,a good test should or could have positive washback(Alderson,
1986b; Pearson, 1988).
Cheng, Watanabe, and Curtis (2004) offered an entire anthology
to the issue of wash back while Spratt (2005) challenged teachers to
become agents of beneficial washback in their language classrooms.
Brown (2010) discusses the factors that provide beneficial washback in
a test.He mentions that such a test can positively influence what and
how teachers teach, students learn; offer learners a chance to
adequately prepare, give learners feedback that enhance their language
development, is more formative in nature than summative, and provide
conditions for peak performance by the learners.
In large-scale assessment, washback often refers to the effects
that tests have on instruction in terms of how students prepare for the
test. In classroom-based assessment, washback can have a number of
positive manisfestations, ranging from the benefit of preparing and
33
reviewing for a test to the learning that accrues from feedback on ones
performance. Teachers can provide information that washes back to
students in the form of useful diagnoses of strengths and weaknesses.
The challenge to teachers is to create classroom tests that serve
as learning devices through which washback is achieved. Students
incorrect responses can become a platform for further improvements.
On the other hand, their correct responses need to be complimented,
especially when they represent accomplishments in a students
developing competence. Teachers can have various strategies in
providing guidance or coaching. Washback enhances a number of
basic principles of language acquisition namely intrinsic motivation,
autonomy, self-confidence, language ego, interlanguage, and strategic
investment, among others.
Washback is generally said to be either positive or negative.
Unfortunately, students and teachers tend to think of the negative
effects of testing such as test-driven curricula and only studying and
learning what they need to know for the test. Positive washback, or
what we prefer to call guided washback can benefit teachers, students
and administrators. Positive washback assumes that testing and
curriculum design are both based on clear course outcomes, which are
known to both students and teachers/testers. If students perceive that
tests are markers of their progress towards achieving these outcomes,
they have a sense of accomplishment. In short, tests must be part of
learning experiences for all involved. Positive washback occurs when a
test encourages good teaching practice.
Washback is particularly obvious when the tests or examinations
in question are regarded as being very vital and having a definite impact
on the students or test-takers future. We would expect, for example,
that national standardised examinations would have strong washback
effects compared to a school-based or classroom-based test.
34
4.5.9 Authenticity
Another major principle of language testing is authenticity. It is a
concept that is difficult to define, particularly within the art and science
of evaluating and designing test. Citing Bachman and Palmer (1996) in
Brown (2010) authenticity is the degree of correspondence of the
characteristics of a given language test task to the features of a target
language task (p.23) and then suggested an agenda for identifying
those target language tasks and for transforming them into valid test
items.
Language learners are motivated to perform when they are faced
with tasks that reflect real world situations and contexts. Good testing
or assessment strives to use formats and tasks that reflect the types of
situation in which students would authentically use the target language.
Whenever possible, teachers should attempt to use authentic materials
in testing language skills.
4.6.0 Interpretability
Test interpretation encompasses all the ways that meaning is
assigned to the scores. Proper interpretation requires knowledge
about the test, which can be obtained by studying its manual and other
materials along with current research literature with respect to its
use; no one should undertake the interpretation of scores on any test
without such study. In any test interpretation, the following
considerations should be taken into account.
A. Consider Reliability: Reliability is important because it is a
prerequisite to validity and because the degree to which a score may
vary due to measurement error is an important factor in its
interpretation.
B. Consider Validity: Proper test interpretation requires knowledge of
the validity evidence available for the intended use of the test. Its
35
TOPIC 5
DESIGNING CLASSROOM LANGUAGE

TEST
validity for other uses is not relevant. Indeed, use of a measurement

for a purpose for which it was not designed may constitute misuse.
The nature of the validity evidence required for a test depends upon its
use.
C. Scores, Norms, and Related technical Features: The result of
scoring a test or subtest is usually a number called a raw score, which
by itself is not interpretable. Additional steps are needed to translate
the number directly into either a verbal description (e.g., pass or
fail) or into a derived score (e.g., a standard score). Less than full
understanding of these procedures is likely to produce errors in
interpretation and ultimately in counseling or other uses.
D. Administration and Scoring Variation: Stated criteria for score
interpretation assume standard procedures for administering and
scoring the test. Departures from standard conditions and
procedures modify and often invalidate these criteria.
Study some of commercially produced tests and evaluate the authenticity

of these tests/ test items.
5.0
SYNOPSIS
Discuss the importance of authenticity in testing.
Topic 5Based
exposes
the stages
of test and
construction,
the
preparing of discuss
test
on you
samples
of formative
summative
assessments,
aspectsspecifications,
of reliability/validity
that must
considered
in these
blueprint/test
the elements
in abeTest
Specifications
Guidelines
assessments.
And the importance of following the guidelines for constructing tests items.
Then we
look atmeasures
the various
test
formats can
that take
are appropriate
for language
Discuss
that
a teacher
to ensure high
validity of
language
assessment
for
the
primary
classroom.
assessment.
5.1
LEARNING OUTCOMES
1.
identify the different stages of test construction

36
2.
3.
describe the features of a test specification

draw up a test specification that reflect both the purpose and the
4.
5.
6.
objectives of the test

compare and contrast Blooms taxonomy and SOLO taxonomy
categorise test items according to Blooms taxonomy
discuss the elements of test items of high quality, reliability and
7.
8.
validity
identify the elements in a Test Specifications Guidelines
demonstrate an understanding of the importance of following the
9.
guidelines for constructing tests items

illustrate test formats that are appropriate and meet the
requirements of the learning outcomes
5.2
FRAMEWORK OF TOPICS
CONTENT
SESSION FIVE (3 hours)
5.3
Stages of Test Construction

Constructing a test is not an easy task; it requires a variety of skills
along with deep knowledge in the area for which the test is to be
constructed. The steps include:
i
ii
iii
iv
v
determining
planning
writing
preparing
reviewing
vi
vii
pre-testing
validating
37
5.3.1 Determining
The essential first step in testing is to make oneself perfectly
clear about what it is one wants to know and for what purpose. When
we start to construct a test, the following questions have to be
answered.
Who are the examinees?

What kind of test is to be made?
What is the precise purpose?
What abilities are to be tested?
How detailed and how accurate the results must be?
How important is the backwash effect?
What constraints are set by the unavailability of expertise, facilities,
time of construction, administration, and scoring?

What is the scope of the test?
5.3.2 Planning
The first form that the solution takes is a set of specifications for
the test.This will include information on: content, format and timing,
criteria,levels of performance, and scoring procedures.
In this stage, the test constructor has to determine the content by
answering the following questions:
Describing the purpose of the test;
Describing the characteristics of the test takers, the nature of the
population of the examinees for whom the test is being designed.
Defining the nature of the ability we want to measure;
Developing a plan for evaluating the qualities of test usefulness, which
is the degree to which a test is useful for teachers and students, it
includes six qualities: reliability, validity, authenticity, practicality interactiveness, and impact;
Identifying resources and developing a plan for their allocation and
management;
Determining format and timing of the test;
Determining levels of performance;
Determining scoring procedures
38
5.3.3 Writing
Although writing items is time-consuming, writing good items is an art.
No one can expect to be able consistently to produce perfect items.
Some items will have to be rejected, others reworked. The best way to
identify items that have to be improved or abandoned is through
teamwork. Colleagues must really try to find fault; and despite the
seemingly inevitable emotional attachment that item writers develop to
items that they have created, they must be open to, and ready to
accept, the criticisms that are offered to them. Good personal relations
are a desirable quality in any test writing team.
Test items writers should possess the following characteristics:
They have to be experienced in test construction.
They have to be quite knowledgeable of the content of the test.
They should have the capacity in using language clearly and

economically.
They have to be ready to sacrifice time and energy.
Another basic aspect in writing the items of the test is sampling.

Sampling means that test constructors choose widely from the whole
area of the course content. It is most unlikely that everything found
under the heading of 'Content in the specifications can be included in
any one version of the test. Choices have to be made for content
validity and for beneficial backwash. One should not concentrate solely
on elements known to be easy to test. Rather, the content of the test
should be a representative sample of the course material.
I
5.3.4 Preparing
One has to understand the major principles, techniques and experience
of preparing the test items. Not every teacher can make a good tester.
To construct different kinds of tests, the tester should observe some
principles. In the production-type tests, we have to bear in mind that no
39
comments are necessary. Test writers should also try to avoid test
items, which can be answered through test-
wiseness. Test-
wiseness refers to the capacity of the examinees to utilise the

characteristics and formats of the test to guess the correct answer.
5.3.5 Reviewing
Principles for reviewing test items:
The test should not be reviewed immediately after its construction,
but after some considerable time.
Other teachers or testers should review it. In a language test, it is
preferable if native speakers are available to review the test.
5.3.6 Pre-testing
After reviewing the test, it should be submitted to pre-testing.
The tester should administer the newly-developed test to a group of
examinees similar to the target group and the purpose is to analyse
every individual item as well as the whole test.
Numerical data (test results) should be collected to check the
efficiency of the item, it should include item facility and
discrimination.
5.3.7 Validating
Item Facility (IF) shows to what extent the item is easy or difficult. The
items should neither be too easy nor too difficult. To measure the facility
or easiness of the item, the following formula is used:
IF= number of correct responses (c) / total number of candidates (N)
And to measure item difficulty:
IF= (w) / (N)
The results of such equations range from 0 1. An item with a
facility index of 0 is too difficult, and with 1 is too easy. The ideal item is
one with the value of (0.5) and the acceptability range for item facility is
between [0.37 0.63], i.e. less than 0.37 is difficult, and above 0.63 is
40
easy.
Thus, tests which are too easy or too difficult for a given sample
population, often show low reliability. As noted in Topic 4, reliability is
one of the complementary aspects of measurement.
5.4
Preparing Test Blueprint / Test Specifications

Test specifications (specs) for classroom use can be an outline of your
test (Brown, 2010), what it will look like. Consider your test
specs as a blueprint of the test that include the following:
a description of its content
item types (methods, such as multiple-choice, cloze, etc.)
tasks (e.g. written essay, reading a short passage, etc.)
skills to be included
how the test will be scored
how it will be reported to students

For classroom purposes (Davidson & Lynch, 2002), the specs
are your guiding plan for designing an instrument that effectively fulfils
your desired principles, especially validity.
It is vital to note that for large-scale standardised tests like Test
of English as a Foreign Language (TOEFL Test), International
English Language Testing System (IELTS), Michigan English
Language Assessment Battery) MELAB, and the like, that are intended
to be widely distributed and thus are broadly generalised, test
specifications are much more formal and detailed (Spaan, 2006). They
are also usually confidential so that the institution that is designing the
test can ensure the validity of subsequent forms of a test.
Many language teachers claim that it is difficult to construct an item. In
reality, it is rather easy to develop an item, if we are committed in the
planning of the measuring instruments to evaluate students
achievement.
41
However, what exactly is an item for a test? An item is a tool, an

instrument, instruction or question used to get feedback from testtakers, which is an evidence t of something that is being measured. An
item is an instrument used to get feedback, which is a useful information
for consideration in measuring or asserting a construct measurement.
Items can be classified as a recall and thinking item. A recall item is the
item that requires one to recall in order to answer, and a thinking item
refers to an item that requires test-takers to use their thinking skills to
attempt.
For instance, in a grammar unit test that will be administered at
the end of a three-week grammar course for high beginning adult
learners (Level 2). The students will be taking a test that covers verb
tenses and two integrated skills (listening/speaking and reading/writing)
and the grammar class they attend serves to reinforce the grammatical
forms that they have learnt in the two earlier classes.
Based on the scenario above, the test specs that you design
might consist of the four sequential steps:
1. a broad outline of how the test will be organised
2. which of the eight sub-skills you will test
3. what the various tasks and item types will be
4. how results will be scored, reported to students, and used in future
class (washback)
Besides knowing the purpose of the test you are creating, you
are required to know as precisely as possible what it is you want to test.
Do not conduct a test hastily. Instead, you need to examine the
objectives for the unit you are testing carefully.
5.5
Blooms and SOLO Taxonomies

5.5.1 Blooms Taxonomy (Revised)
Blooms Taxonomy is a systematic way of describing how a
learners performance develops from simple to complex levels in their
affective, psychomotor and cognitive domain of learning. The Original
Taxonomy provided carefully developed definitions for each of the six
major categories in the cognitive domain. The categories were
42
Knowledge, Comprehension, Application, Analysis, Synthesis, and

Evaluation. With the exception of Application, each of these was
broken into subcategories. The complete structure of the original
Taxonomy is shown in Figure 5.1.
Figure 5.1: Original Terms of Blooms Taxonomy

Retrieved from: http://www. kurwongbss.qld.edu.au/thinking/Bloom/blooms.htm
Adapted from: Pohl, 2000, Learning to Think, Thinking to Learn, p.8
The categories were ordered from simple to complex and from

concrete to abstract. Further, it was assumed that the original
Taxonomy represented a cumulative hierarchy; that is, mastery of each
simpler category was prerequisite to mastery of the next more complex
one. In their cognitive domain, there are six stages, namely:
Knowledge, Comprehension, Application, Analysis, Synthesis and
Evaluation. Unfortunately, traditional education tends to base the
student learning in this domain. In the original Taxonomy, the
Knowledge category embodied both noun and verb aspects. The noun
or subject matter aspect was specified in Knowledge's extensive
subcategories. The verb aspect was included in the definition given to
Knowledge in that the student was expected to be able to recall or
recognise knowledge. This brought uni-dimensionality to the framework
at the cost of a Knowledge category that was dual in nature and thus
different from the other Taxonomic categories. In 1990s, Anderson
(former student of
Bloom) eliminated this inconsistency in the revised

43
Taxonomy by allowing these two aspects, the noun and verb, to form
separate dimensions, the noun providing the basis for the Knowledge
dimension
and the verb forming the basis for the Cognitive Process
dimension as shown in Figure 5.2.
Figure 5.2: Blooms Revised Taxonomy

Retrieved from: http://www. kurwongbss.qld.edu.au/thinking/Bloom/blooms.htm
In the revised Blooms Taxonomy, the names of six major

categories were changed from noun to verb forms. As the taxonomy
44
reflects different forms of thinking and thinking is an active process

verbs were used instead of nouns.
Besides, the subcategories of the six major categories were also
replaced by verbs and some subcategories were re-organised.
The
knowledge category was renamed. Knowledge is an outcome or
product of
thinking not a form of thinking per se. Consequently, the
word knowledge
was inappropriate to describe a category of thinking
and was replaced with
the word remembering instead. Comprehension
and synthesis were
retitled to understanding and creating respectively,
in order to better reflect
the nature of the thinking defined in each
category. Table 3 below
provides a summary of the above.

Table 3: The Cognitive Dimension Process
Level 1 C1
Categories &
Cognitive Processes
Alternative Names
Remember
Definition
Retrieve knowledge
from long-term
memory
Recognising
Identifying
Locating knowledge in
long-term memory that
is consistent with
presented material
Recalling
Retrieving
Retrieving relevant
knowledge from longterm memory
Level 2 C2
Categories &
Cognitive Processes
Alternative Names
Understand
Definition
Construct meaning
from instructional
messages, including
45
oral, written, and

graphic
communication
Interpreting
Clarifying
Paraphrasing
Representing
Translating
Exemplifying
Illustrating
Instantiating
Classifying
Categorising
Subsuming
Summarising
Abstracting
Generalising
Inferring
Comparing
Concluding
Extrapolating
Interpolating
Predicting
Contrasting
Mapping
Matching
Changing from one form

of representation to
another
Finding a specific
example or illustration of
a concept or principle
Determining that
something belongs to a
category
Abstracting a general
theme or major point(s)
Drawing a logical
conclusion from
presenting information
Detecting
correspondences
between two ideas,
objects, and the like
Constructing models
Explaining
Constructing a cause
and effect model of a
system
Level 3 C3
Categories &
Cognitive Processes
Alternative Names
Apply
Definition
Applying a procedure
to a familiar task
Carrying out
Executing
Exemplifying
Applying a procedure to
a familiar task
Illustrating
Instantiating
46
Applying a procedure to
an unfamiliar task
Using
Analyse
Differentiating
Organising
Break materials into

its constituent parts
and determine how the
parts relate to one
another and to an
overall structure or
purpose
Discriminating
Distinguishing
Focusing
Selecting
Finding coherence
Integrating
Outlining
Parsing
Structuring
Deconstructing
Distinguishing relevant
from irrelevant parts or
important from
unimportant parts of
presented material
Determining how
elements fit or function
within a structure
Attributing
Determining a point of
view, bias, values, or
intent underlying
presented material
Evaluating
Make judgments
based on criteria and
standards
Checking
Coordinating
Detecting
Monitoring
Testing
Detecting
inconsistencies or
fallacies within a
process or product,
determining whether a
process or product has
internal consistency;
detecting the
effectiveness of a
procedure as it is being
implemented
Judging
Critiquing
Detecting
inconsistencies
betweena product and
external
criteria;determining
whether a product has
external consistency;
detecting the
appropriateness of a
47
procedure for a given

problem
Create
Putting elements
together to form a
coherent or functional
whole; reorganise
elements into a new
pattern or structure
Hypothesising
Generating
Coming upwith
alternative hypotheses
based on criteria
Designing
Planning
Devising a procedure for

accomplishing some
task
Constructing
Producing
Inventing a product
The Knowledge Domain
Categories &
Cognitive Processes
Factual Knowledge
Definition
The basic elements students must know to the
acquainted with a discipline or solve problems in it
Conceptual Knowledge
The interrelationships among the basic elements

within a larger structure that enable them to
function together
Procedural Knowledge
How to do something, methods of inquiry, and

criteria for using skills, algorithms, techniques, and
methods
Metacognitive
Knowledge
Knowledge of cognition in general as well as

awareness and knowledge of ones own cognition
5.5.2 SOLO Taxonomy

On the other hand, SOLO, which stands for the Structure of the
Observed Learning Outcome, taxonomy is a systematic way of
48
describing how a learners performance develops from simple to

complex levels in their learning. Biggs & Collis first introduced it, in their
1982 study. There are 5 stages, namely Prestructural, Unistructural,
Multistructural, which are in a quantitative phrase and Relational and
Extended Abstract, which are in a qualitative phrase.
Students find learning more complex as it advances. SOLO is
a means of classifying learning outcomes in terms of their complexity,
enabling teachers to assess students work in terms of its quality not of
how many bits of this and of that they got right. At first we pick up only
one or few aspects of the task (unistructural), then several aspects but
they are unrelated (multistructural), then we learn how to integrate
them into a whole (relational), and finally, we are able to generalise that
whole to as yet untaught applications (extended abstract). The
below shows lists verbs typical of each such level.
49
diagram
50
Figure 5.3: SOLO Taxonomy

The SOLO taxonomy maps the complexity of a students work by linking
it to one of five phases: little or no understanding (Prestructural), through a
simple and then more developed grasp of the topic (Unistructural and
Multistructural), to the ability to link the ideas and elements of a task together
(Relational) and finally (Extended Abstract) to understand the topic for
themselves, possibly going beyond the initial scope of the task (Biggs & Collis,
51
1982; Hattie & Brown, 2004). In their later research into multimodal learning,
Biggs & Collis noted that there was an increase in the structural complexity of
their (the students) responses (1991:64).
It may be useful to view the SOLO taxonomy as an integrated strategy,
to be used in lesson design, in task guidance and formative and summative
assessment (Smith & Colby, 2007; Black & William, 2009; Hattie, 2009; Smith,
2011). The structure of the taxonomy encourages viewing learning as an ongoing process, moving from simple recall of facts towards a deeper
understanding; that learning is a series of interconnected webs that can be
built upon and extended. Nckles et al., (2009:261) elaborates:
Cognitive strategies such as organization and elaboration are at the
heart of meaningful learning because they enable the learner to
organize learning into a coherent structure and integrate new
information with existing knowledge, thereby enabling deep
understanding and long-term retention.
This would help to develop Smiths (2011:92) self-regulating, self-evaluating
learners who were well motivated by learning.
A range of SOLO based techniques exist to assist teachers and
students. Use of constructional alignment (Biggs & Tang, 2009) encourages
teachers to be more explicit when creating learning objectives, focusing on
what the student should be able to do and at which level. This is essential for a
student to make progress and allows for the creation of rubrics, for use in class
(Black &Wiliam, 2009; Nckles et al., 2009; Huang, 2012), to make the
process explicit to the student. Use of HOTS viz. Higher Order Thinking Skills)
maps (Hook & Mills, 2011) can be used in English to scaffold in depth
discussion, encouraging students to:
Develop interpretations, use research and critical thinking
effectively to develop their own answers, and write essays that
engage with the critical conversation of the field (Linkon, 2005:247,
cited in Allen, 2011).
52
It may also be helpful in providing a range of techniques for differentiated

learning (Anderson, 2007; Hook & Mills, 2012).
The SOLO taxonomy has a number of proponents. Hook & Mills
(2011:5) refer to it as a model of learning outcomes that helps schools develop
a common understanding. Moseley et al. (2005:306) advocates its use as a
framework for developing the quality of assessment citing that it is easily
communicable to students. Hattie (2012:54), in his wide-ranging investigation
into effective teaching and visible learning, outlines three levels of
understanding: surface, deep and conceptual. He indicates that:
The most powerful model for understanding these three levels and
integrating them into learning intentions and success criteria is the
SOLO model.
However, the taxonomy is not without critics; Chick (1998:20) believes
that there is potential to misjudge the level of functioning and Chan et al.
(2002:512) criticises its conceptual ambiguity stating that the categorisation
is unstable. In these two studies, the SOLO taxonomy was used primarily for
assessing completed work, so use throughout the teaching process may
alleviate these issues.
An additional criticism, in particular when the taxonomy is compared
with that of Bloom (1956), is the SOLO taxonomys structure. Biggs & Collis
(1991) refers to the structure as a hierarchy, as does Moseley et al. (2005);
naturally, there are concerns when complex processes, such as human
thought, are categorised in this manner. However, Campbell et al. (1992)
explained the structure of the SOLO taxonomy as consisting as a series of
cycles (especially between the Unistructural, Multistructural and Relational
levels), which would allow for a development of breadth of knowledge as well
as depth.
However, SOLO taxonomy can be used not only in designing the

curriculum in terms of the learning outcomes intended, but also in assessment.
53
It can be effectively used for students to deconstruct exam questions to

understand marks awarded and as a vehicle for self-assessment and peerassessment.
5.6
Guidelines for constructing test items

Tests do not work without well-written test items. Test-takers appreciate
clearly written questions that do not attempt to trick or confuse them into
incorrect responses. The following presents the major characteristics of wellwritten test items.
5.6.1 Aim of the test
Test item development is a critical step in building a test that properly
meets certain standards. A good test is only as good as the quality of the test
items. If the individual test items are not appropriate and do not perform well,
how can the test scores be meaningful? The topic to be evaluated (construct)
and where the evaluation is done (title/context) must be part of the curriculum.
If it is evaluated outside the curriculum, the curricular validity of the item can
be disputed. Therefore, test items must be developed to precisely measure the
objectives prescribed by the blueprint and meet quality standards.
5.6.2 Range of the topics to be tested
A test must measure the test-takers ability or proficiency in applying the
knowledge and principles on the topics that they have learnt. Ample
opportunity must be given to students to learn the topics that are to be
evaluated. This opportunity would include the availability of language
teachers, well-equipped facilities, and the expertise of the language teachers
in conducting the lessons and providing the skills and knowledge that would be
evaluated to the test-takers or students.
5.6.3 Range of skills to be tested
Test item writers should always attempt to write test items that measure
higher levels of cognitive processing. This is not an easy task. It should be a
54
goal of the writer to ensure their items have cognitive characteristics

exemplifying understanding, problem-solving, critical
thinking, analysis, synthesis, evaluation and interpreting rather than just
declarative knowledge. There are many theories that provide frameworks on
levels of thinking and Blooms taxonomy is often cited as a tool to use in item
writing. Always stick to writing important questions that represent and can
predict that a test-taker is proficient at high levels of cognitive processing in
doing their test proficiently.
5.6.4 Test format
Test items should always follow a consistent design so that the
questioning process in itself does not give unnecessary difficulty to answering
questions. Therefore a logical and consistent stimulus format for writing test
items can help expedite the laborious process of writing test items as well as
supply a format for asking basic questions. A format that provides an initial
starting structure to use in writing questions can be valuable for item writers.
When these formats are used, test takers can quickly read and understand the
questions, since the format is expected. For example, to measure
understanding of knowledge or facts, questions can begin with the following:
What best defines .?
What is not the characteristic of .?
What is an example of .?
5.6.5 Level of difficulty
A test has a planned number of questions at a level of difficulty and
discrimination to best determine mastery and non-mastery performance states.
Test-takers should clearly understand what is needed in education and
language assessment to prepare for the examination and how much
experience performing certain activities would help in preparation. This should
be the road map that helps item writers create test items and helps test takers
55
understand what will be required of them to pass an examination. In any test

item construction, we must assure that weak students could answer easy item,
intermediate language proficiency students could answer easy and moderate
items whereas high language proficiency students could answer easy,
moderate and advance test items. A reliable and valid test instrument should
encompass all three levels of difficulties.
5.6.6 International and Cultural Considerations (biasness)
In standardised tests when exams are distributed internationally, either
in a single language or translated to other languages, always refrain from
the use of slang, geographic references, historical references or dates
(holidays) that may not be understood by an international examinee. Tests
need to be adapted to other society so that meaning is fully translated correctly
and benefits are not given to a particular group of test-takers. Steps should be
taken to avoid item content that may bias gender, race or other cultural groups.
56
What are the good characteristics of a test item?

Explain each characteristic of a test item in a graphic organiser.
http://books.google.com.my/books/about/Constructing_Test_Items.html?id=Ia3SGDfbaV
6.0
Test format
What is the difference between test format and test type? For example,
when you want to introduce new kinds of test, for example, reading test, which
is organised a little bit different from the existing test items, what do you say?
Test format or test type? Test format refers to the layout of questions on a test.
For example, the format of a test could be two essay questions, 50 multiplechoice questions, etc.For the sake of brevity, I will consider providing the
outlines of some large-scale standardised tests.
UPSR
Primary School Evaluation Test, also as known Ujian Penilaian Sekolah
Rendah (commonly abbreviated as UPSR; Malay), is a national examination
taken by all pupils in our country at the end of their sixth year in primary
school before they leave for secondary school. It is prepared and examined by
the Malaysian Examinations Syndicate. This test consists of two papers
namely Paper 1 and Paper 2.
Multiple-choice questions are tested using a standardised optical
answer sheet that uses optical mark recognition for detecting answers for
Paper 1 and Paper 2 comprises three sections, namely Sections A, B, and C.
TOEFL (Teaching of Foreign Language)
57
The TOEFL test is administered two ways; as an Internet-based test

(TOEFL iBT), and as a paper-based test (TOEFL PBT). Most of the 4,500+
test sites in the world use the TOEFL iBT.The TOEFL iBT test is given in
English and administered via the Internet. There are four sections (listening,
reading, speaking and writing), which take a total of about four and a half
hours to complete.
IELTS Test Format

IELTS is a test of all four language skills Listening, Reading, Writing &
Speaking. Test-takers will take the Listening, Reading and Writing tests all on
the same day one after the other, with no breaks in between. Depending on
the examinees test centre, ones Speaking test may be on the same day as
the other three tests, or up to seven days before or after that. The total test
time is under three hours. The test format is illustrated below.
58
Figure 6: IELTS Test Format
59
TOPIC 6
6.0
ASSESSING LANGUAGE SKILLS

CONTENT
SYNOPSIS
Topic 6 focuses on ways to assess language skills and language
content. It defines the types of test items used to assess language skills
and language content. It also provides teachers with suggestions on
ways a teacher can assess the listening, speaking, reading and writing
skills in a classroom. It also discusses concepts of and differences
between discrete point test, integrative test and communicative test.
6.1
LEARNING OUTCOMES
At the end of Topic 6, teachers will be able to:
Identify and carry out the different types of assessment to assess

language skills and language content
Understand anddifferentiate between objective and subjective

testing
Understand and differentiate between discrete point test,

integrative test and communicative test in assessing language.
6.2
FRAMEWORK OF TOPICS
CONTENT
60
SESSION SIX (6 hours)

6.2.1
Types of test items to assess language skills

a. Listening
Basically there are two kinds of listening tests: tests that test specific aspects
of listening, like sound discrimination; and task based tests which test skills in
accomplishing different types of listening tasks considered important for the
students being tested. In addition to this, Brown 2010 identified four types of
listening performance from which assessment could be considered.
i. Intensive : listening for perception of the components (phonemes, words,
intonation, discourse markers,etc) of a larger stretch of language.
ii. Responsive : listening to a relatively short stretch of language ( a
greeting, question, command, comprehension check, etc.) in order to
make an equally short response
iii. Selective : processing stretches of discourse such as short monologues
for several minutes in order to scan for certain information. The
purpose of such performance is not necessarily to look for global or
general meaning but to be able to comprehend designated information
in a context of longer stretches of spoken language( such as classroom
directions from a teacher, TV or radio news items, or stories).
Assessment tasks in selective listening could ask students, for example,
to listen for names, numbers, grammatical category, directions (in a map
exercise), or certain facts and events.
iv. Extensive : listening to develop a top-down , global
understanding of spoken language. Extensive performance
ranges from listening to lengthy lectures to listening to a
conversation and deriving a comprehensive message or
purpose. Listening for the gist or the main idea- and making
inferences are all part of extensive listening.
b.
Speaking
In the assessment of oral production, both discrete feature
objective tests and integrative task-based tests are used. The first
type tests such skills as pronunciation, knowledge of what
language is appropriate in different situations, language required in
doing different things like describing, giving directions, giving
61
instructions, etc. The second type involves finding out if pupils can
perform different tasks using spoken language that is appropriate
for the purpose and the context. Task-based activities involve
describing scenes shown in a picture, participating in a discussion
about a given topic, narrating a story, etc. As in the listening
performance assessment tasks, Brown 2010 cited four categories
for oral assessment.
1.
Imitative . At one end of a continuum of types of speaking

performance is the ability to imitate a word or phrase or possibly
a sentence. Although this is a purely phonetic level of oral
production, a number of prosodic (intonation, rhythm,etc.), lexical
, and grammatical properties of language may be included in the
performance criteria. We are interested only in what is
traditionally labelled pronunciation; no inference are made
about the test-takers ability to understand or convey meaning or
to participate in an interactive conversation. The only role of
listening here is in the short-term storage of a prompt, just long
enough to allow the speaker to retain the short stretch of
2.
language that must be imitated.

Intensive. The production of short stretches of oral language
designed to demonstrate competence in a narrow band of
grammatical, phrasal, lexical, or phonological relationships.
Examples of intensive assessment tasks include directed
response tasks (requests for specific production of speech),
reading aloud, sentence and dialogue completion, limited picturecued tasks including simple sentences, and translation up to the
simple sentence level.

3. Responsive. Responsive assessment tasks include interaction
and test comprehension but at somewhat limited level of very
short conversation, standard greetings, and small talk, simple
requests and comments, etc. The stimulus is almost always a
spoken prompt (to preserve authenticity) with one or two follow-up
questions or retorts:
62
A.
Liza : Excuse me, do you have the time?

Don : Yeah. Six-fifteen.
B.
Jo : What is the most urgent social problem today?

Sue : I would say bullying.
C.
Lan : Hey, Shan, hows it going?

Shan: Not bad, and yourself?
Lan : Im good.
Shan: Cool. Okay gotta go.
4. Interactive. The difference between responsive and interactive

speaking is in the length and complexity of the interaction, which
sometimes includes multiple exchanges and/or multiple
participants. Interaction can be broken down into two types : (a)
transactional language, which has the purpose of exchanging
specific information, and (b) interpersonal exchanges, which have
the purpose of maintaining social relationships. (In the three
dialogues cited above, A and B are transactional, and C is
interpersonal).
5. Extensive (monologue). Extensive oral production tasks include
speeches, oral presentations, and storytelling, during which the
opportunity for oral interaction from listeners is either highly limited
(perhaps to nonverbal responses) or ruled out together. Language
style is more deliberative (planning is involved) and formal for
extensive tasks.In can include informal monologue such as
casually delivered speech (e.g., recalling a vacation in the
mountains, conveying recipes, recounting the plot of a novel or
movie).
c.
Reading
Cohen (1994), discussed various types of reading and meaning
assessed. He describes skimming and scanning as two different types
of reading. In the first, a respondent is given a lengthy passage and is
required to inspect it rapidly (skim) or read to locate specific
information (scan) within a short period of time. He also discusses
63
receptive reading or intensive reading which refers toa form of

reading aimed at discovering exactly what the author seeks to
convey(p. 218). This is the most common form of reading especially
in test or assessment conditions. Another type of reading is to read
responsively where respondents are expected to respond to some
point in a reading text through writing or by answering questions.
A reading text can also convey various kinds of meaning and reading
involves the interpretation or comprehension of these meanings. First,
grammatical meaning are meanings that are expressed through
linguistic structures such as complex and simple sentences and the
correct interpretation of those structures. A second meaning is
informational meaning which refers largely to the concept or messages
contained in the text. Respondents may be required to comprehend
merely the information or content of the passage and this may be
assessed through various means such as summary and prcis writing.
Compared to grammatical or syntactic meaning, informational meaning
requires a more general understanding of a text rather than having to
pay close attention to the linguistic structure of sentences. A third
meaning contained in many texts is discourse meaning. This refers to
the perception of rhetorical functions conveyed by the text. One typical
function is discourse marking which adds cohesiveness to a text.
These words, such as unless, however, thus, therefore etc., are crucial
to the correct interpretation of a text and students may be assessed on
their ability to understand the discoursal meaning that they bring in the
passage. Finally, a fourth meaning which may also be an object of
assessment in a reading test is the meaning conveyed by the writers
tone. The writers tone whether it is cynical, sarcastic, sad or etc.- is
important in reading comprehension but may be quite difficult to
identify, especially by less proficient learners. Nevertheless, there can
be many situations where the reader is completely wrong in
comprehending a text simply because he has failed to perceive the
correct tone of the author.
d. Writing
64
Brown (2004), identifies three different genres of writing which are

academic writing, job-related writing and personal writing, each of
which can be expanded to include many different examples. Fiction,
for example, may be considered as personal writing according to
Browns taxonomy. Brown (2010) identified four categories of written
performance that capture the range of written production which can
be used to assess writing skill.
1.
Imitative. To produce written language, the learner must attain the

skills in the fundamental, basic tasks of writing letters, words,
punctuation, and brief sentences. This category includes the
ability to spell correctly and to perceive phoneme-grapheme
correspondences in the English spelling system. At this stage the
learners are trying to master the mechanics of writing. Form is
the primary focus while context and meaning are of secondary
2.
concern.
Intensive (controlled). Beyond the fundamentals of imitative
writing are skills in producing appropriate vocabulary within a
context, collocation and idioms, and correct grammatical features
up to the length of a sentence. Meaning and context are
important in determining correctness and appropriateness but
most assessment tasks are more concerned with a focus on form
3.
and are rather strictly controlled by the test design.

Responsive. Assessment tasks require learners to perform at a
limited discourse level, connecting sentences into a paragraph
and creating a logically connected sequence of two or three
paragraphs. Tasks relate to pedagogical directives, lists of criteria,
outlines, and other guidelines. Genres of writing include brief
narratives and descriptions, short reports, lab reports, summaries,
brief responses to reading, and interpretations of charts and
graphs. Form-focused attention is mostly at the discourse level,
4.
with a strong emphasis on context and meaning.

Extensive. Extensive writing implies successful management of all
the processes and strategies of writing for all purposes, up to the
length of an essay, a term paper, a major research project report,
65
or even a thesis. Focus is on achieving a purpose, organizing and

developing ideas logically, using details to support or illustrate
ideas, demonstrating syntactic and lexical variety, and in many
cases, engaging in the process of multiple drafts to achieve a final
product. Focus on grammatical form is limited to occasional
editing and proofreading of a draft.
6.2.2 Objective and Subjective test
Tests have been categorized in many different ways. The most
familiar terms regarding tests are the objective and subjective
tests . We normally associate objective tests with multiple choice
question type tests and subjective tests with essays. However, to
be more accurate we will consider how the test is graded. Objective
tests are tests that are graded objectively while subjective tests are
thought to involve subjectivity in grading.
There are many examples of each type of test. Objective type tests
include the multiple choice test, true false items and matching items
because each of these are graded objectively. In these examples of
objective tests, there is only
one correct response and the grader does not need to subjectively
assess the response.
Examples of the subjective test include essays and short answer
questions. However some other types of common tests such as the
dictation test, filling in the blank type tests, as well as interviews and
role plays can be considered subjective and objective type tests
where they fall on some sort of continuum where some tests are
more objective than others. As such, some of these tests would fall
closer to one end of the continuum or the other.
Two other terms, select type tests and supply type tests are related
terms when we think of objective and subjective tests. In most
cases, objective tests are similar to select type tests where students
66
are expected to select or choose the answer from a list of options.

Just as a multiple choice question test is an objective type test, it
can also be considered a select type test. Similarly, tests involving
essay type questions are supply type as the students are expected
to supply the answer through their essay. How then would you
classify a fill in the blank type test? Definitely for this type of test,
the students need to supply the answer, but what is supplied is
merely a single word or a short phrase which differs tremendously
from an essay. It may therefore be helpful to once again consider a
continuum with supply type and select type items at each end of the
continuum respectively.
It is possible to now combine both continua as shown in Figure 6.1
with the two different test formats placed within the two continua:
Figure 6.1: Continua for different types of test formats
It is not by accident that we find there are few, if any, test formats that are
either supply type and objective or select type and subjective. Select type
tests tend to be objective while supply type tests tend to be subjective.
In addition to the above, Brown and Hudson (1998), have also suggested
three broad categories to differentiate tests according to how students are
expected to respond. These categories are the selected response tests, the
constructed response tests, and the personal response tests. Examples of
each of these types of tests are given in Table 6.1.
Table 6.1: Types of Tests According to Students Expected Response

Selected response
Constructed response
67
Personal response
True false
Fill-in
Conferences
Matching
Short answer
Portfolios
Multiple choice
Performance test
Self and peer

assessments
Selected response assessments, according to Brown and Hudson (1998),

are assessment procedures in which students typically do not create any
language but rather select the answer from a given list (p. 658).
Constructed response assessment procedures require students to
produce language by writing, speaking, or doing something else (p.
660). Personal response assessments, on the other hand, require
students to produce language but also allows each students response to
be different from one another and for students to communicate what they
want to communicate (p. 663). These three types of tests, categorised
according to how students respond, are useful when we wish to
determine what students need to do when they attempt to answer test
questions.
6.2.3
Types of test items to assess language content

a.
Discrete Point Test and Integrative Test

Language tests may also be categorised as either discrete point or
integrative. Discrete point tests examine one element at a time.
Integrative tests, on the other hand,requires the candidate to
combine many language elements in the completion of a
task(Hughes, 1989: 16). It is a simultaneous measure of
knowledge and ability of a variety of language features, modes, or
skills.
A multiple choice type test is usually cited as an example of a
discrete point test while essays are commonly regarded as the
epitome of integrative tests. However, both the discrete point test
and the integrative test are a matter of degree. A test may be more
discrete point than another and similarly a test may be more
68
integrative than another. Perhaps the more important aspect is to be

aware of the discrete point or integrative nature of a test as we must
be careful of what we believe the test measures.
This brings us to the question of how discrete point is a multiple
choice question type item? While it is definitely more discrete point
than an essay, it may still require more than just one skill or ability in
order to complete. Lets say you are interested in testing a students
knowledge of the relative pronoun and decide to do so by using a
multiple choice test item. If he fails to answer this test item correctly,
would you conclude that the student has problems with the relative
pronoun? The answer may not be as straight forward as it seems.
The test is presented in textual form and therefore requires the
student to read. As such, even the multiple choice test item involves
some integration of language skills as this example shows, where in
addition to the grammatical knowledge of relative pronouns, the
student must also be able to read and understand the question.
Perhaps a clearer way of viewing the distinction between the
discrete point and the integrative test is to examine the perspective
each takes toward language. In the discrete point test, language is
seen to be made up of smaller units and it may be possible to test
language by testing each unit at a time. Testing knowledge of the
relative pronoun, for example, is certainly assessing the students on
a particular unit of language and not on the language as a whole. In
an integrative test, on the other hand, the perspective of language is
that of an integrated whole which cannot be broken up into smaller
units or elements. Hence, the testing of language should maintain
the integrity or wholeness of the language.
b.
Communicative Test
As language teaching has emphasised the importance of
communication through the communicative approach, it is not surprising
that communicative tests have also been given prominence. A
69
communicative emphasis in testing involves many aspects, two of

which revolve around communicative elements in tests and meaningful
content. Both these aspects are briefly addressed in the following sub
sections:
Integrating Communicative Elements into Examinations

Alderson and Banerjee (2002), report on various studies that seem to
point to the difficulty in achieving authenticity in tests. They cite SpenceBrown (2001) who posits thatthe very act of assessment changes the
nature of a potentially authentic task and compromises authenticityand
thatauthenticity must be related to the implementation of an activity,
not to its design(p. 99). In her study, students were required to
interview native speakers outside the classroom and submit a taperecording of the interview. While this activity seems quite authentic, the
students were observed to prepare for the interview byrehearsing the
interview, editing the results, and engaging in spontaneous, but flawed
discourse(Alderson & Banerjee, 2002: 99), all of which are inauthentic
when viewed in terms of real life situations. Alderson himself argues
that because candidates in language tests are not interested in
communicating but to display their language abilities, the test situation
is a communicative event in itself and therefore cannot be used to
replicate any real world event (p. 98).
Chalhoub-Deville (2003), argues for tests that take context into
consideration. She believes that there should be ashift in focus of our
measurement from traditional examinations of the construct in terms of
response consistency, to investigations that systematically explore
inconsistent (which does not mean random) performances across
contexts(p. 378). In the future, besides context, tests will also need to
integrate elements of communication such as topic initiation, topic
maintenance, and topic change in order for the test to become more
authentic and realistic. Due to issues of practicality, involving especially
the amount of time and extent of organisation to allow for such
70
communicative elements to emerge, it will not be an easy task to

achieve.
The idea of bringing communicative elements into the language test is
not a new one. In his review of communicative tests, Fulcher (2000),
notes the descriptors of a communicative test as suggested by several
theorists. The three principles of communicative tests that he highlights
are that communicative tests:
involve performance;
are authentic; and
are scored on real-life outcomes.
In short, the kinds of tests that we should expect more of in the future
will be communicative tests in which candidates actually have to
produce the language in an interactive setting involving some degree of
unpredictability which is typical of any language interaction situation.
These tests would also take the communicative purpose of the
interaction into consideration and require the student to interact with
language that is actual and unsimplified for the learner. Fulcher finally
points out that in a communicative test, the only real criterion of
success is the behavioural outcome, or whether the learner was able
to achieve the intended communicative effect (p. 493). It is obvious
from this description that the communicative test may not be so easily
developed and implemented. Practical reasons may hinder some of the
demands listed. Nevertheless, a solution to this problem has to be
found in the near future in order to have valid language that are
purposeful and can stimulate positive washback in teaching and
learning.
71
Exercise 1
TOPIC 7
7.0
1.
In your opinion and based on your teaching

experience, how would you conduct the testing of
reading, writing and speaking skills of your own
students? What are the methods that you employ?
Share this with your classmates and exchange ideas.
2.
Describe three different types of writing

performance as suggested by Brown (2004)
and relate their relationship to academic writing,
job related writing and personal writing.
SCORING, GRADING AND

ASSESSMENT CRITERIA
SYNOPSIS
Topic 7 focuses on the scoring, grading and assessment criteria. It provides
teachers with brief descriptions on the different approaches to scoring
namely:-objective, holistic and analytic.
7.1
LEARNING OUTCOMES
By the end of Topic 7, teachers will be able to:

72
Identify and differentiate the different approaches used in scoring

Use the different approaches used in scoring in assessing language
7.2
FRAMEWORK OF TOPICS
CONTENT
SESSION SEVEN (3 hours)
7.2.1
Objective approach
A type of scoring approach is the objective scoring approach. This scoring
approach relies on quantified methods of evaluating students writing. A
sample of how objective scoring is conducted is given by Bailey (1999) as
follows:
Establish standardization by limiting the length of the assessment: Count

the first 250 words of the essay.
Identify the elements to be assessed: Go through the essay up to the 250th
word underlining every mistake from spelling and mechanics through
verb tenses, morphology, vocabulary, etc. Include every error that a literate
reader might note.
Operationalise the assessment: Assign a weight score to each error, from 3
to 1. A score of 3 is a severe distortion of readability or flow of ideas; 2 is a
moderate distortion; and 1 is a minor error that does not affect readability in
any significant way.
Quantify the assessment: Calculate the essay Correctness Score by using
250 words as the numerator of a fraction, and the sum of error scores as
the denominator: The denominator is the sum of all the error scores:
73
7.2.2 Holistic approach

In holistic scoring, the reader reacts to the students compositions as a
whole and a single score is awarded to the writing. Normally this score is
on a scale of 1 to 4, or 1 to 6, or even 1 to 10.(Bailey, 1998 : 187). Each
score on the scale will be accompanied with general descriptors of ability.
The following is an example of a holistic scoring scheme based on a 6
point scale.
Table 7.1: Holistic Scoring Scheme

Source: S.S. Moya, Evaluation Assistance Center (EAC)-East, Georgetown
University, Washington
RRating
5
5-6
CCriteria
Vocabulary is precise, varied, and vivid.

Organization is appropriate to writing assignment
and contains clear introduction, development of
ideas, and conclusion.
Transition from one idea to another is smooth
and provides reader with clear understanding that
topic is changing.
Meaning is conveyed effectively.
A few mechanical errors may be present but
do not disrupt communication.
Shows a clear understanding of writing and
topic development.
Vocabulary is adequate for grade level.
Events are organized logically, but some part of
the sample may not be fully developed.
74
Some transition of ideas is evident.

Meaning is conveyed but breaks down at
times.
Mechanical errors are present but do not
disrupt communication.
Shows a good understanding of writing and
topic development.
Vocabulary is simple. Organization may be
extremely simple or there may be evidence of
disorganization.
There are a few transitional markers or
repetitive transitional markers.
Meaning is frequently not clear.
Mechanical errors affect communication.
Shows some understanding of writing and
topic development.
Vocabulary is limited and repetitious. Sample
is comprised of only a few disjointed
sentences.
No transitional markers.
Meaning is unclear.
Mechanical errors cause serious disruption in
communication.
Shows little evidence of discourse
understanding.
Responds with a few isolated words. No
complete sentences are written.
No evidence of concepts of writing.
No response.
The 6 point scale above includes broad descriptors of what a students essay
reflects for each band. It is quite apparent that graders using this scale are
expected to pay attention to vocabulary, meaning, organisation, topic
development and communication. Mechanics such as punctuation are
secondary to communication.
Bailey also describes another type of scoring related to the holistic approach
which she refers to as primary trait scoring. In primary trait scoring, a particular
functional focus is selected which is based on the purpose of the writing and
grading is based on how well the student is able to express that function. For
example, if the function is to persuade, scoring would be on how well the
author has been able to persuade the grader rather than how well organised
the ideas were, or how grammatical the structures in the essay were. This
technique to grading emphasises functional and communicative ability rather
than discrete linguistic ability and accuracy.
7.2.3 Analytic approach
75
Analytical scoring is a familiar approach to many teachers. In analytical

scoring, raters assess students performance on a variety of categories which
are hypothesised to make up the skill of writing. Content, for example, is
often seen as an important aspect of writing i.e. is there substance to what
is written? Is the essay meaningful? Similarly, we may also want to consider
the organisation of the essay. Does the writer begin the essay with an
appropriate topic sentence?
Are there good transitions between paragraphs? Other categories that we
may want to also consider include vocabulary, language use and mechanics.
The following are some possible components used in assessing writing
ability using an analytical scoring approach and the suggested weightage
assigned to each:
Components
Content
Organisation
Vocabulary
Language Used
Mechanics
Weight
30 points
20 points
20 points
25 points
5 points
The points assigned to each component reflect the importance of

each of the components.
Comparing the Three Approaches
Each of the three scoring approaches claims to have its own advantages
and disadvantages. These can be illustrated by Table 7.2
Table 7.2: Comparison of the Advantages and Disadvantages of the
Three Approaches to Scoring Essays
Scoring
Approach
Holistic
Advantages
Disadvantages
Quickly graded
Provide a public standard that is
understood by the teachers and
students alike
Relatively higher degree of rater
reliability
Applicable to the assessment of
many different topics
Emphasise the students
strengths rather than their
weaknesses.
It provides clear guidelines in
76
The single score may actually mask

differences
across individual compositions.
Does not provide a lot of diagnostic feedback
Writing ability is unnaturally split up into
Analytical
Objective
grading in the form of the various

components.
components.
Allows the graders to consciously
address important aspects of
writing.
Emphasises the students
Still some degree of subjectivity involved.

strengths rather than their
Accentuates negative aspects of the

weaknesses.
learners
writing without giving credit for what they can
do well.
EXERCISE
1.
TOPIC 8
8.0
Based on your understanding, draw a mind map to indicate the

advantages and disadvantages of the three approaches to
scoring
essays. AND INTERPRETATION
ITEM
ANALYSIS
SYNOPSIS
Topic 8 focuses on item analysis and interpretation. It provides teachers with
brief descriptions on basic statistics terminologies such as mode, median, mean,
standard deviation, standard score and interpretation of data. It will also look at
some item analysis that deals with item difficulty and item discrimination.
Teachers will also be introduced to distractor analysis in language assessment.
8.1 LEARNING OUTCOMES

Identify and differentiate some basic statistics terminologies used.
determine how well items discriminate using item discrimination; and
Analyse how well a distractor in a test item performs
8.2
FRAMEWORK OF TOPICS
77
CONTENT
SESSION EIGHT (6 hours)
8.2.1 Basic Statistics

Let us assume that you have just graded the test papers for your class. You
now have a set of scores. If a person were to ask you about the performance
of the students in your class, it would be very difficult to give all the scores in
the class. Instead, you may prefer to cite only one score.
Or perhaps you would like to report on the performance by giving some
values that would help provide a good indication of how the students in your
class performed. What values would you give? In this section, we will look at
two kinds of measures, namely measures of central tendency and measures
of dispersion. Both these types of measures are useful in score reporting.
Central tendency measures the extent to which a set of scores gathers
around. There are three major measures of central tendency. They are the
mode, median and mean.
MODE
Mode is the most frequently occurring raw score in a set of

scores.
The following is a set of scores:
15, 13, 12, 12, 13, 16, 13, 17, 14, 18
What is the mode for this set of scores? If you said 13, then
you are correct as it occurs more often than others. It is
78
MEDIAN
MEAN
8.2.2
possible to have one mode in a set of scores. If there are

two modes, then the set of scores is referred to as being
bimodal.
The median refers to the score that is in the middle of the
set of scores when the scores are arranged in ascending or
descending order. There are seven scores in the set of
scores above. If we arrange it in order based on value, it
would be 45, 47, 50, 51, 52, 54, 65. In this set of scores, the
median will be 51 as it is the middle score. There are three
scores lower than it and an equal number of scores higher
than it.
What happens when there are an even number of scores?
Lets take the following set of scores as an example:
45, 47, 50, 51, 52, 53, 54, 65
As there is no one score that is in the middle, we need to
take the two in the middle, add them up and divide by two.
As such, the median is 51.5 as (51 + 52)/2 or 103/2 =51.5.
Always remember, however, that when we wish to find the
median, we have to first arrange the scores in either
ascending or descending order of value.
The mean of a set of test scores is the arithmetic mean or
average and is calculated as SX/N where S (sigma) refers
to the sum of, X refers to the raw or observed scores, and N
is the number of observed scores. Look at the following set
of scores:
47, 65, 45, 54, 50, 52, 51
The mean for this set of scores is 364/7 = 52
Standard deviation
Standard deviation refers to how much the scores deviate from the mean.
There are two methods of calculating standard deviation which are the
deviation method and raw score method which are illustrated by the following
formulae.
To illustrate this, we will use 20, 25,30. Using standard deviation method, we
come up with the following table:
Table 8.1:Calculating the Standard Deviation Using the Deviation Method
79
Using the raw score method, we can come up with the following:
Table 8.2 : Calculating the Standard Deviation Using the Raw Score Method
Both methods result in the same final value of 5. If you are calculating
standard deviation with a calculator, it is suggested that the deviation
80
method be used when there are only a few scores and the raw score
method be used when there are many scores. This is because when
there are many scores, it will be tedious to calculate the square of the
deviations and their sum.
8.2.3 Standard score
Standardised scores are necessary when we want to make
comparisons across tests and measurements. Z scores and T scores
are the more common forms of standardised scores although you
may come up with your own standardised score. A standardised score
can be computed for every raw score in a set of scores for a test.
i. The Z score
The Z score is the basic standardised score. It is referred to as the
basic form as other computations of standardised scores must first
calculate the Z score. The formula used to calculate the Z score is as
follows:
Table 8.3: Calculating the Z Score for a Set of Scores
81
Z score values are very small and usually range only from 2 to 2.
Such small values make it inappropriate for score reporting especially
for those unaccustomed to the concept. Imagine what a parent may
say if his child comes home with a report card with a Z score of 0.47 in
English Language! Fortunately, there is another form of standardised
score - the T score with values that are more palatable to the
relevant parties.
ii.
8.2.4
The T score
The T score is a standardised score which can be computed using the
formula 10 (Z) + 50. As such, the T score for students A, B, C, and D in
the table 4.3 are 10(-1.28) + 50; 10 (-0.23) + 50; 10(0.47) + 50; and 10
(1.04) + 50 or 37.2, 47.7, 54.7, and 60.4 respectively. These values
seem perfectly appropriate compared to the Z score. The T score
average or mean is always 50 (i.e. a standard deviation of 0) which
connotes an average ability and the mid point of a 100 point scale.
Interpretation of data
The standardised score is actually a very important score if we want to
compare performance across tests and between students. Let us take the
following scenario as an example:
How can En. Abu solve this problem? He would have to have
standardised scores in order to decide. This would require the following
information:
Test 1 : X = 42 standard deviation= 7
82
Test 2 : X = 47 standard deviation= 8

Using the information above, En. Abu can find the Z score for each raw
score reported as follows:
Table 8.4: Z Score for Form 2A
Based on Table 8.4, both Ali and Chong have a negative Z score as
their total score for both tests. However, Chong has a higher Z score
total (i.e. 1.07 compared to 1.34) and therefore performed better
when we take the performance of all the other students into
consideration.
THE NORMAL CURVE

The normal curve is a hypothetical curve that is supposed to represent all
naturally occurring phenomena. It is assumed that if we were to sample a
particular characteristic such as the height of Malaysian men, then we will
find that while most will have an average height of perhaps 5 feet 4 inches,
there will be a few who will be relatively shorter and an equal number who
are relatively taller. By plotting the heights of all Malaysian men according to
frequency of occurrence, it is expected that we would obtain something
similar to a normal distribution curve. Similarly, test scores that measure any
characteristic such as intelligence, language proficiency or writing ability of a
specific population is also expected to provide us with a normal curve.
The following is a diagram illustrating how the normal curve would look like.
83
Figure 8.1: The normal distribution or Bell curve
The normal curve in Figure 8.1 is partitioned according to standard

deviations (i.e. 4s, -3s, + 3s, + 4s) which are indicated on the horizontal
axis. The area of the curve between standard deviations is indicated in
percentage on the diagram. For example, the area between the mean (0
standard deviation) and +1 standard deviation is 34.13%. Similarly, the
area between the mean and 1 standard deviation is also 34.13%. As
such, the area between 1 and 1 standard deviations is 68.26%.
In using the normal curve, it is important to make a distinction between
standard deviation values and standard deviation scores. A standard
deviation value is a constant and is shown on the horizontal axis of the
diagram above. The standard deviation score, on the other hand, is the
obtained score when we use the standard deviation formula provided
earlier. So, if we find the score to be 5 as in the earlier example, then the
score for the standard deviation value of 1 is 5 and for the value of 2 is 5
x 2 = 10 and for the value of 3 is 15 and so on. Standard deviation values
of 1, -2, and 3 will have corresponding negative scores of 5, -10, and
15.
8.2.5
Item analysis
84
a.
Item difficulty
Item difficulty refers to how easy or difficult an item is. The formula
used to measure item difficulty is quite straightforward. It involves
finding out how many students answered an item correctly and
dividing it by the number of students who took this test. The formula is
therefore:
For example, if twenty students took a test and 15 of them correctly

answered item 1, then the item difficulty for item 1 is 15/20 or 0.75.
Item difficulty is always reported in decimal points and can range from
0 to 1. An item difficulty of 0 refers to an extremely difficult item with no
students getting the item correct and an item difficulty of 1 refers to an
easy item which all students answered correctly.
The appropriate difficulty level will depend on the purpose of the test.
According to Anastasi & Urbina (1997), if the test is to assess mastery,
then items with a difficulty level of 0.8 can be accepted. However, they
go on to describe that if the purpose of the test is for selection, then we
should utilise items whose difficulty values come closest to the desired
selection ratio for example, if we want to select 20%, then we should
choose items with a difficulty index of 0.20.
b. Item discrimination
Item discrimination is used to determine how well an item is able to
discriminate between good and poor students. Item discrimination values
range from 1 to 1. A value of 1 means that the item discriminates
perfectly, but in the wrong direction. This value would tell us that the
weaker students performed better on a item than the better students.
This is hardly what we want from an item and if we obtain such a value,
it may indicate that there is something not quite right with the item. It is
strongly recommended that we examine the item to see whether it is
ambiguous or poorly written. A discrimination value of 1 shows positive
85
discrimination with the better students performing much better than the
weaker ones as is to be expected.
Lets use the following instance as an example. Suppose you have just
conducted a twenty item test and obtained the following results:
Table 8.5: Item Discrimination
As there are twelve students in the class, 33% of this total would be 4
students. Therefore, the upper group and lower group will each consist
of 4 students each. Based on their total scores, the upper group would
consist of students L, A, E, and G while the lower group would consist of
students J, H, D and I.
86
We now need to look at the performance of these students for each item
in order to find the item discrimination index of each item.
For item 1, all four students in the upper group (L, A, E, and G)
answered correctly while only student H in the lower group answered
correctly. Using the formula described earlier, we can plug in the
numbers as follows:
Two points should be noted. First, item discrimination is especially

important in norm referenced testing and interpretation as in such
instances there is a need to discriminate between good students who
do well in the measure and weaker students who perform poorly. In
criterion referenced tests, item discrimination does not have as
important a role. Secondly, the use of 33.3% of the total number of
students who took the test in the formula is not inflexible as it is possible
to use any percentage between 27.5% to 35% as the value.
c.
Distractor analysis
Distractor analysis is an extension of item analysis, using techniques
that are similar to item difficulty and item discrimination. In distractor
analysis, however, we are no longer interested in how test takers select
the correct answer, but how the distractors were able to function
effectively by drawing the test takers away from the correct answer. The
number of times each distractor is selected is noted in order to
determine the effectiveness of the distractor. We would expect that the
distractor is selected by enough candidates for it to be a viable
distractor.
What exactly is an acceptable value? This depends to a large extent on
87
the difficulty of the item itself and what we consider to be an acceptable

item difficulty value for test items. If we are to assume that 0.7 is an
appropriate item difficulty value, then we should expect that the
remaining 0.3 be about evenly distributed among the distractors.
Let us take the following test item as an example:

In the story, he was unhappy because_____________________________
A. it rained all day
B. he was scolded
C. he hurt himself
D. the weather was hot
Let us assume that 100 students took the test. If we assume that A is the
answer and the item difficulty is 0.7, then 70 students answered correctly.
What about the remaining 30 students and the effectiveness of the three
distractors? If all 30 selected D, then distractors B and C are useless in their
role as distractors. Similarly, if 15 students selected D and another 15
selected B, then C is not an effective distractor and should be replaced.
Therefore, the ideal situation would be for each of the three distractors to be
selected by an equal number of all students who did not get the answer
correct, i.e. in this case 10 students. Therefore the effectiveness of each
distractor can be quantified as 10/100 or 0.1 where 10 is the number of
students who selected the tiems and 100 is the total number of students
who took the test. This technique is similar to a difficulty index although the
result does not indicate the difficulty of each item, but rather the
effectiveness of the distractor. In the first situation described in this
paragraph, options A, B, C and D would have a difficulty index of 0.7, 0, 0,
and 0.3 respectively. If the distractors worked equally well, then the indices
would be 0.7, 0.1, 0.1, and 0.1. Unlike in determining the difficulty of an
item, the value of the difficulty index formula for the distractors must be
interpreted in relation to the indices for the other distractors.
From a different perspective, the item discrimination formula can also be
used in distractor analysis. The concept of upper groups and lower groups
would still remain, but the analysis and expectation would differ slightly from
the regular item discrimination that we have looked at earlier. Instead of
88
expecting a positive value, we should logically expect a negative value as

more students from the lower group should select distractors. Each
distractor can have its own item discrimination value in order to analyse how
the distractors work and ultimately refine the effectiveness of the test item
itself.
Table 8.6: Selection of Distractors
Distractor A
Distractor B
Distractor C
Distractor D
Item 1
8*
Item 2
8*
Item 3
8*
Item 4
8*
Item 5
7*
d.
* indicates key
For Item 1, the discrimination index for each distractor can be calculated
using the discrimination index formula. From Table 8.5, we know that all the
students in the upper group answered this item correctly and only one student
from the lower group did so. If we assume that the three remaining students
from the lower group all selected distractor B, then the discrimination index for
item 1, distractor B will be:
This negative value indicates that more students from the lower group
selected the distractor compared to students from the upper group. This result
is to be expected of a distractor and a value of -1 to 0 is preferred.
EXERCISE
1. Calculate the mean, mode, median and range of the following set of
scores:
23, 24, 25, 23, 24, 23, 23, 26, 27, 22, 28.
2. What is a normal curve and what does this show? Does the final
result always show a normal curve and how does this relate to
standardised tests?
89
TOPIC 9
REPORTING OF ASSESSMENT DATA
9.0 SYNOPSIS
Topic 9 focuses on reporting assessment data. It provides teachers with brief
descriptions on the purposes of reporting and the reporting methods.
Understand the purposes of reporting of assessment data
Understand and use the different reporting methods in language assessment
9.2 FRAMEWORK OF TOPICS
90
CONTENT
SESSION NINE (3 hours)
9.2.1 Purposes of reporting

We can say that the main purpose of tests is to obtain information
concerning a particular behaviour or characteristic. Based on information
obtained from tests, several different types of decisions can be made.
Kubiszyn & Borich (2000), mention eight different types of decisions
made on the basis of information obtained from tests. These educational
decisions are shown in Figure 9.1
91
Figure 9.1 :Eight Types of Decisions Mode
Instructional decisions are made based on test results when, for

example, teachers decide to change or maintain their instructional
approach. If a teacher finds out that most of his class have failed his
test, there are many possible reactions he can have. The teacher
could evaluate the effectiveness of his own teaching or instructional
approach and implement the necessary changes.Tests yield scores
and teachers will have to make decisions in terms of the kind of
grades to give students. As grades are indicators of student
performance, teachers need to decide whether a student deserves a
high grade perhaps an A on the basis of some form of
assessment.
Traditionally, and perhaps for a long time to come, this assessment will be
in the form of tests. Sometimes, we give tests to find out the strengths
and weaknesses of our students.
Decisions related to selection, placement, counselling and guidance,
programme or curriculum, and administrative policy are all made at
levels higher than the classroom.
Administrators, educational agencies and institutions may be involved in
these decisions.
Selection and placement decisions are somewhat similar. However, a
92
selection decision relates to whether or not a student is selected for a

programme or for admission into an institution based on a test score.
Tests such as TOEFL and IELTS are often used by universities to decide
whether a candidate is suitable, and hence selected for admission.
A placement decision, however, deals with where a candidate should
be placed based on performance on the test. A clear example is the
language placement examination for newly admitted students commonly
administered by many local and foreign universities.
Based on their performance on such a test, students are placed into
different language classes that are arranged according to proficiency
levels.
Counselling and guidance decisions are also made by relevant parties
such as counsellors and administrators on the basis of exam results.
Counsellors often give advice in terms of appropriate vocations for some
of their students. These advice is likely to be made on the basis of the
students own test scores. Programme or curriculum decisions reflect the
kinds of changes made to the educational programme or curriculum
based on examination results. Finally, there are also administrative
policy decisions that need to be made which are also greatly influenced
by test scores.
9.2.2
Reporting methods
Student achievement progress can be reported by comparing:
i. Norm - Referenced Assessment and Reporting
Assessing and reporting a student's achievement and progress in
comparison to other students.
ii Criterion - Referenced Assessment and Reporting
93
Assessing and reporting a student's achievement and progress in

comparison to predetermined criteria.
An outcomes-approach to assessment will provide information about
student achievement to enable reporting against a standards
framework.
iii An outcomes-approach
Acknowledges that students, regardless of their class or grade, can be
working towards syllabus outcomes anywhere along the learning
continuum.
Principles of effective and informative assessment and reporting

Effective and informative assessment and reporting practice:
Has clear, direct links with outcomes

The assessment strategies employed by the teacher in the
classroom need to be directly linked to and reflect the syllabus
outcomes. Syllabus outcomes in stages will describe the standard
against which student achievement is assessed and reported.
Is integral to teaching and learning

Effective and informative assessment practice involves selecting
strategies that are naturally derived from well structured teaching
and learning activities. These strategies should provide information
concerning student progress and achievement that helps inform
ongoing teaching and learning as well as the diagnosis of areas of
strength and need.
Is balanced, comprehensive and varied

Effective and informative assessment practice involves teachers
using a variety of assessment strategies that give students multiple
opportunities, in varying contexts, to demonstrate what they know,
understand and can do in relation to the syllabus outcomes.
Effective and informative reporting of student achievement takes a
number of forms including traditional reporting, student profiles,
Basic Skills Tests, parent and student interviews, annotations on
94
student work, comments in workBooks, portfolios, certificates and

awards.
Is valid
Assessment strategies should accurately and appropriately assess
clearly defined aspects of student achievement. If a strategy does
not accurately assess what it is designed to assess, then its use is
misleading.
Valid assessment strategies are those that reflect the actual
intention of teaching and learning activities, based on syllabus
outcomes.
Where values and attitudes are expressed in syllabus outcomes,
these too should be assessed as part of student learning.
Is fair
Effective and informative assessment strategies are designed to
ensure equal opportunity for success regardless of students' age,
gender, physical or other disability, culture, background language,
socio-economic status or geographic location.
Engages the learner
Effective and informative assessment practice is student centred.
Ideally there is a cooperative interaction between teacher and
students, and among the students themselves.
The syllabus outcomes and the assessment processes to be used
should be made explicit to students. Students should participate in
the negotiation of learning tasks and actively monitor and reflect
upon their achievements and progress.
Values teacher judgement
Good assessment practice involves teachers making judgements,
on the weight of assessment evidence, about student progress
towards the achievement of outcomes.
Teachers can be confident a student has achieved an outcome
when the student has successfully demonstrated that outcome a
number of times, and in varying contexts.
The reliability of teacher judgement is enhanced when teachers
cooperatively develop a shared understanding of what constitutes
achievement of an outcome. This is developed through cooperative
programming and discussing samples of student work and
achievements within and between schools. Teacher judgement
95
based on well defined standards is a valuable and rich form of

student assessment.
Is time efficient and manageable
Effective and informative assessment practice is time efficient and
supports teaching and learning by providing constructive feedback to
the teacher and student that will guide further learning.
Teachers need to plan carefully the timing, frequency and nature of their
assessment strategies. Good planning ensures that assessment and
reporting is manageable and maximises the usefulness of the strategies
selected (for example, by addressing several outcomes in one
assessment task).
Recognises individual achievement and progress
Effective and informative assessment practice acknowledges that
students are individuals who develop differently. All students must be
given appropriate opportunities to demonstrate achievement.
Effective and informative assessment and reporting practice is
sensitive to the self esteem and general well-being of students,
providing honest and constructive feedback.
Values and attitudes outcomes are an important part of learning that
should be assessed and reported. They are distinct from knowledge,
understanding and skill outcomes.
Involves a whole school approach
An effective and informative assessment and reporting policy is
developed through a planned and coordinated whole school approach.
Decisions about assessment and reporting cannot be taken
independently of issues relating to curriculum, class groupings,
timetabling, programming and resource allocation.
Actively involves parents
Schools and their communities are responsible for jointly developing
assessment and reporting practices and policies according to their local
needs and expectations.
Schools should ensure full and informed participation by parents in the
continuing development and review of the school policy on reporting
processes.
Conveys meaningful and useful information
Reporting of student achievement serves a number of purposes, for a
variety of audiences. Students, parents, teachers, other schools and
employers are potential audiences. Schools can use student
96
achievement information at a number of levels including individual, class,

grade or school. This information helps identify students for targeted
intervention and can inform school improvement programs. The form of
the report must clearly serve its intended purpose and audience.
Effective and informative reporting acknowledges that students can be
demonstrating progress and achievement of syllabus outcomes across
stages, not just within stages.
Good reporting practice takes into account the expectations of the
school community and system requirements, particularly the need for
information about standards that will enable parents to know how their
children are progressing.
Student achievement and progress can be reported by comparing
students' work against a standards framework of syllabus outcomes,
comparing their prior and current learning achievements, or comparing
their achievements to those of other students. Reporting can involve a
combination of these methods. It is important for schools and parents to
explore which methods of reporting will provide the most meaningful and
useful information.
TOPIC 10
ISSUES AND CONCERNS RELATED TO

ASSESSMENT IN MALAYSIAN PRIMARY
SCHOOLS
10.0 SYNOPSIS
Topic 10 focuses on the issues and concerns related to assessment in the
Malaysian primary schools. It will look at how assessment is viewed and used
in Malaysia.
Understand some issues and concerns regarding assessment in the

Malaysian primary schools
Understand Chapter 4 of the Malaysian Education Blueprint 2013-2025
Use the different types of assessment in assessing language in school
(cognitive-level,school-based and alternative assessment)
97
10.2 FRAMEWORK OF TOPICS
CONTENT
SESSION TEN (3 hours)
10.3
Exam-oriented System
98
The educational administration in Malaysia is highly centralised with four

hierarchical levels; that is, federal, state, district and the lowest level, school.
Major decision-and policy-making take place at the federal level represented
by the Ministry of Education (MoE), which consists of the Curriculum
Development Centre, the school division, and the Malaysian Examination
Syndicate (MES).
The current education system in Malaysia is too examination-oriented and
over-emphasizes rote-learning with institutions of higher learning fast
becoming mere diploma mills.Like most Asian countries (e.g., Gang 1996; Lim
and Tan 1999; Choi 1999); Malaysia so far has focused on public examination
results as important determinants of students progression to higher levels of
education or occupational opportunities (Chiam 1984).
The Malaysian education system requires all students to sit for public
examinations at the end of each level of schooling. There are four public
examinations from primary to postsecondary education. These are the
Primary School Achievement Test (UPSR) at the end of six years of primary
education, the Lower Secondary Examination (PMR) at the end of another
three years schooling, the Malaysian Certificate of Education (SPM) at the
end of 11 years of schooling, and the Malaysian Higher School Certificate
Examination (STPM) or the Higher Malaysian Certificate for Religious
Education (STAM) at the end of 13 years schooling (MoE 2004).
Malaysia Education Blueprint 2013-2025

In October 2011, the Ministry of Education launched a
comprehensive review of the education system in Malaysia in
order to develop a new National Education Blueprint. This
decision was made in the context of rising international
education standards, the Governments aspiration of better
preparing Malaysias children for the needs of the 21st
century, and increased public and parental expectations of
education policy. Over the
course of 11 months, the Ministry drew on many sources of
input, from education experts at UNESCO, World Bank,
OECD, and six local universities, to principals, teachers,
parents, and students from every state in Malaysia. The
result is a preliminary Blueprint
99
that evaluates the performance of Malaysias education

system against historical starting points and international
benchmarks. The Blueprint also offers a vision of the
education system and students that Malaysia both needs and
deserves, and suggests
11 strategic and operational shifts that would be required to
achieve that vision. The Ministry hopes that this effort will
inform the national discussion on how to fundamentally
transform Malaysias education system, and will seek
feedback from across
the community on this preliminary effort before finalising the
Blueprint in December 2012.
The examined Curriculum
In public debate, the issue of teaching to the test has often translated
into debates over whether the UPSR, PMR, and SPM examinations
should be abolished. Summative national examinations should not in
themselves have any negative impact on students. The challenge is that
these examinations do not currently test the full range of skills that the
education system aspires to produce. An external review by Pearson
Education Group of the English examination papers at UPSR
and SPM level noted that these assessments would benefit from
the inclusion of more questions testing higher-order thinking skills,
such as application, analysis, synthesis and evaluation. For example,
their analysis of the 2010 and 2011 English Language UPSR papers
showed that approximately 70% of the questions tested basic skills of
knowledge and comprehension.
LP has started a series of reforms to ensure that, as per policy,
assessments are evaluating students holistically. In 2011, in parallel
with the KSSR, the LP rolled out the new PBS format that is intended
to be more holistic, robust, and aligned to the new standard-referenced
curriculum. There are four components to the new PBS:
School assessment refers to written tests that assess subject
learning. The test questions and marking schemes are developed,
administered, scored, and reported by school teachers based on
guidance from LP;
Central assessment refers to written tests, project work, or
100
oral tests (for languages) that assess subject learning. LP develops

the test questions and marking schemes. The tests are, however,
administered and marked by school teachers;
Psychometric assessment refers to aptitude tests and a
personality inventory to assess students skills, interests, aptitude,
attitude and personality. Aptitude tests are used to assess students
innate and acquired abilities, for example in thinking and problem
solving. The personality inventory is used to identify key traits and
characteristics that make up the students personality. LP develops
these instruments and provides guidelines for use. Schools are,
however, not required to comply with these guidelines; and
Physical, sports, and co-curricular activities assessment
refers to assessments of student performance and participation
in physical and health education, sports, uniformed bodies, clubs,
and other non-school sponsored activities. Schools are given the
flexibility to determine how this component will be assessed.
The new format enables students to be assessed on a broader range of
output over a longer period of time. It also provides teachers with more
regular information to take the appropriate remedial actions for their
students. These changes are hoped to reduce the overall emphasis on
teaching to the test, so that teachers can focus more time on delivering
meaningful learning as stipulated in the curriculum.
In 2014, the PMR national examinations will be replaced with school
and centralised assessment. In 2016, a students UPSR grade will no longer
be derived from a national examination alone, but from a combination of PBS
and the national examination. The format of the SPM remains the same, with
most subjects assessed through thenational examination, and some subjects
through a combination of examinations and centralised assessments.
101
10.4
Cognitive Levels of Assessment
Bloom's Taxonomy of Cognitive Levels
Knowledge
Comprehension
Application
Analysis
Synthesis
Evaluation
Knowledge
Recalling memorized information. May involve remembering a wide range of
material from specific facts to complete theories, but all that is required is the
bringing to mind of the appropriate information. Represents the lowest level of
learning outcomes in the cognitive domain.
Learning objectives at this level: know common terms, know specific facts,
know methods and procedures, know basic concepts, know principles.
Question verbs: Define, list, state, identify, label, name, who? when? where?
what?
Comprehension
The ability to grasp the meaning of material. Translating material from one
form to another (words to numbers), interpreting material (explaining or
summarizing), estimating future trends (predicting consequences or effects).
Goes one step beyond the simple remembering of material, and represent the
lowest level of understanding.
Learning objectives at this level: understand facts and principles, interpret
verbal material, interpret charts and graphs, translate verbal material to
mathematical formulae, estimate the future consequences implied in data,
justify methods and procedures.
Question verbs: Explain, predict, interpret, infer, summarize, convert,
translate, give example, account for, paraphrasex?
Application
The ability to use learned material in new and concrete situations. Applying
rules, methods, concepts, principles, laws, and theories. Learning outcomes
in this area require a higher level of understanding than those under
comprehension.
Learning objectives at this level: apply concepts and principles to new
situations, apply laws and theories to practical situations, solve mathematical
102
problems, construct graphs and charts, demonstrate the correct usage of a

method or procedure.
Question verbs: How couldxbe used toy? How would you show, make use
of, modify, demonstrate, solve, or applyxto conditionsy?
Analysis
The ability to break down material into its component parts. Identifying parts,
analysis of relationships between parts, recognition of the organizational
principles involved. Learning outcomes here represent a higher intellectual
level than comprehension and application because they require an
understanding of both the content and the structural form of the material.
Learning objectives at this level: recognize unstated assumptions, recognizes
logical fallacies in reasoning, distinguish between facts and inferences,
evaluate the relevancy of data, analyze the organizational structure of a work
(art, music, writing).
Question verbs: Differentiate, compare / contrast, distinguish x from y, how
does x affect or relate to y? why? how? What piece of x is missing / needed?
Synthesis
(By definition, synthesis cannot be assessed with multiple-choice questions. It
appears here to complete Bloom's taxonomy.)
The ability to put parts together to form a new whole. This may involve the
production of a unique communication (theme or speech), a plan of
operations (research proposal), or a set of abstract relations (scheme for
classifying information). Learning outcomes in this area stress creative
behaviors, with major emphasis on the formulation of new patterns or
structure.
Learning objectives at this level: write a well organized paper, give a well
organized speech, write a creative short story (or poem or music), propose a
plan for an experiment, integrate learning from different areas into a plan for
solving a problem, formulate a new scheme for classifying objects (or events,
or ideas).
Question verbs: Design, construct, develop, formulate, imagine, create,
change, write a short story and label the following elements:
103
Evaluation
The ability to judge the value of material (statement, novel, poem, research
report) for a given purpose. The judgments are to be based on definite
criteria, which may be internal (organization) or external (relevance to the
purpose). The student may determine the criteria or be given them. Learning
outcomes in this area are highest in the cognitivehierarchy because they
contain elements of all the other categories, plus conscious value judgments
based on clearly defined criteria.
Learning objectives at this level: judge the logical consistency of written
material, judge the adequacy with which conclusions are supported by data,
judge the value of a work (art, music, writing) by the use of internal criteria,
judge the value of a work (art, music, writing) by use of external standards of
excellence.
Question verbs: Justify, appraise, evaluate, judgexaccording to given criteria.
Which option would be better/preferable to partyy?
10.5
School-based Assessment
The traditional system of assessment no longer satisfies the educational
and social needs of the third millennium. In the past few decades, many
countries have made profound reforms in their assessment systems.
Several educational systems have in turn introduced school-based
assessment as part of or instead of external assessment in their
certification. While examination bodies acknowledge the immense
potential of school-based assessment in terms of validity and flexibility,
yet at the same time they have to guard against or deal with difficulties
related to reliability, quality control and quality assurance. In the debate
on school-based assessment, the issue of why has been widely written
about and there is general agreement on the principles of validity of
this form of assessment.
Izard (2001) as well as Raivoce and Pongi (2001) explain that schoolbased assessment (SBA) is often perceived as the process put in place
to collect evidence of what students have achieved, especially in
104
important learning outcomes that do not easily lend themselves to the

pen and paper tests. Daugherty (1994) clarifies that this type of
assessment has been recommended: because of the gains in the
validity which can be expected when students performance on
assessed tasks can be judged in a greater range of contexts and more
frequently than is possible within the constraints of time- limited, written
examinations. However, as Raivoce and Pongi (2001) suggest the
validity of SBA depends to a large extent on the various assessment
tasks students are required to perform.
Burton (1992) provides the following five rules of the thumb that may be
applied in the planning stage of school-based assessment :
1. The assessment should be appropriate to what is being assessed.
2. The assessment should enable the learner to demonstrate positive
achievement and reflect the learners strengths.
3. The criteria for successful performance should be clear to all
concerned
4. The assessment should be appropriate to all persons being assessed
5. The style of assessment should blend with the learning pattern so it
contributes to it.
In the Malaysian SBA context, assessment for and of learning
Standard-referenced Assessment
Holistic
Integrated
Balance
Robust
Components of SBA/ PBS
1.
Academic:
School Assessment (using Performance Standards)

Centralised Assessment
2.
Non-academic:
Physical Activities, Sports and Co-curricular Assessment (Pentaksiran

Aktiviti Jasmani, Sukan dan Kokurikulum - PAJSK)
Psychometric/Psychological Tests
105
Centralised Assessment
Conducted and administered by teachers in schools using instruments,
rubrics, guidelines, time line and procedures prepared by LP
Monitoring and moderation conducted by PBS Committee at School,
District and State Education Department, and LP
School Assessment
The emphasis is on collecting first hand information about pupils learning
based on curriculum standards
Teachers plan the assessment, prepare the instrument and administer the
assessment during teaching and learning process
Teachers mark pupils responses and report their progress continuously.
10.6
Alternative Assessment
Alternative assessments are assessment procedures that differ from

the traditional notions and practice of tests with respect to format,
performance, or implementation. It is likely that alternative assessment
found its roots in writing assessment because of the need to provide
continuous assessment rather than a single impromptu evaluation
(Alderson & Banerjee, 2001).
As the term indicates, alternative assessments are assessment

proposals that present alternatives to the more traditional
examination formats. They have become more popular of late because
of some doubts raised regarding the ability of traditional assessment to
elicit a fair and accurate measure of a students performance.
Alternative assessment brings together with it a complete set of
perspectives that contrast against traditional tests and assessments.
Table 10.1 illustrates some of the major differences between traditional
and alternative assessments.
106
Table 10.1: Contrasting Traditional and Alternative Assessment

Source: Adapted from Bailey (1998:207 and Puhl, 1997: 5)
Traditional Assessment
Alternative Assessment
One-shot tests
Continuous, longitudinal assessment
Indirect tests
Direct tests
Inauthentic tests
Authentic assessment
Individual projects
Group projects
No feedback to learners
Feedback provided to learners
Speeded exams
Power exams
Decontextualised test tasks
Contextualised test tasks
Norm-referenced score reporting Criterion-referenced score reporting

Standardised tests
Classroom-based tests
Summative
Formative
Product of instruction
Process of instruction
Intrusive
Integrated
Judgmental
Developmental
Teacher proof
Teacher mediated
In discussing alternative assessments, Herman et al. (1992: 6) list several of

their common characteristics. They describe alternative assessments as
performing the following:
Ask the students to perform, create, produce, or do something.
Tap higher-level thinking and problem-solving skills.
Use tasks that represent meaningful instructional activities.
Invoke real-world applications.
People, not machines, do the scoring, using human judgment.
Require new instructional and assessment roles for teachers.

107
Alternative assessments are suggested largely due to a growing concern that

traditional assessments are not able to accurately measure the ability we are
interested in. They are also seen to be more student centred as they cater
for different learning styles, cultural and educational backgrounds as well as
language proficiencies.
Tannenbaum (1996), comments that alternative assessments focus on
documenting individual strengths and development which would assist in
the teaching and learning process.
Nevertheless, although alternative assessments are compatible with the
contemporary emphases on the process as well as product of learning
(Croker, 1999), several shortcomings of alternative assessments have been
noted.
Perhaps one of the major limitations of alternative assessments is that
accounts of the benefits of alternative assessment tend to be descriptive and
persuasive, rather than research-based (Alderson & Banerjee, 2001: 229).
Alternative assessments are also said to be limited to the classroom and has
not become part of mainstream assessment. Brown and Hudson, in
advocating alternative assessment, seem to have taken a safer approach by
suggesting the term alternatives in assessment. They believe that educators
should be familiar with all possible formats of assessment and decide on the
format that best measures the ability or construct that they are interested in.
Hence, these alternatives would include all possible assessment formats both
traditional and informal.
Despite these limitations, alternative assessments present a viable and
exciting option in eliciting and assessing the students actual abilities. There
are a number of test formats that are considered alternative assessment
formats.
Physical demonstration
Pictorial products
Reading response logs
K-W-L (what I know/what I want to know/what Ive learned) charts
Dialogue journals
Checklists
Teacher-pupils conferences
Interviews
Performace tasks
Portfolios
Self assessment
Peer assessment
108
Portfolios
A well known and commonly uses alternative assessment is the portfolio
assessment. The contents of the portfolio become evidence of abilities
much like how we would use a test to measure the abilities of our
students.
Bailey (1998, p: 218), describes a portfolio to contain four primary
elements.
First, it should have an introduction to the portfolio itself
which provides an overview to the content of the portfolio. Bailey
even suggests that this section include a reflective essay by the
student in order to help express the students thoughts and
feelings about the portfolio, perhaps explaining strengths and
possible weaknesses as well as explain why certain pieces are
included in the portfolio.
Secondly, she argues that portfolios should have what she

refers to as an academic works section. This section is meant to
demonstrate the students improvement or achievement in the
major skill areas (p. 218).
The third section is described as a personal section in

which students may wish to include their journals, score reports
of tests that they have sat for, as well as photographs and other
items that illustrate their experiences with as well as
achievements in the English language.
Finally, an assessment section may contain evaluations

made by peers, teachers as well as self evaluations.
Table 10.1:Contents of a Portfolio

Source: Adapted from Bailey (1998: 218)
Introductory Section
Academic Works Section
Overview
Reflective Essay
Samples of best work

Samples of work demonstrating
development
Personal Section
Assessment Section
Evaluation by peers
Self-evaluation
Journals
Score reports
Photographs
Personal items
The portfolio can be said to be a students personal documentation that

helps demonstrate his or her ability and successes in the language. It
may even require students to consciously select items that can document
their own progress as learners. The actual compilation of the content of
the portfolio is in itself a learning experience. Some suggest that students
109
should attach a short reflection on each piece or item placed in the

portfolio. Portfolio assessment, therefore, is both a learning and
assessment experience. This dual function can be considered as one of
the benefits of portfolio assessment.
Brown and Hudson (1998), summarise several other advantages in using
portfolios in assessment. They discuss these advantages according to
how the portfolio strengthens students learning, enhances the teachers
role and improves the testing process. With respect to testing, the
advantages of using portfolio as an assessment instrument are listed as
follows (pp.664-665):
enhances student and teacher involvement in
assessment;
provides opportunities for teachers to observe students

using meaningful language;
to accomplish various authentic tasks in a variety of

contexts and situations;
permit the assessment of the multiple dimensions of

language learning;
provide opportunities for both students and teachers to

work together and reflect on what it means to assess students
language growth;
increase the variety of information collected on students;

and
make teachers ways of assessing student work more

systematic.
Self Assessment and Peer Assessment

Two other common forms of alternative assessment are the selfassessment and peer-assessment procedures. Both these forms
of assessment are strongly advocated by Puhl (1997) as she
believes that they are essential to continuous assessment, a
cornerstone to alternative assessment. The benefits of self and
peer assessment are especially found in formative stages of
assessment in which the development of the students abilities
are emphasised.
Self appraisals are also thought to be quite accurate and are said
to increase student motivation. Puhl (1997), describes a case
study in which she believes self-assessment forced the students
110
to reread and thereby make necessary editing and corrections to

their essays before they handed them in. Nevertheless, in order
for self assessment to be useful and not a futile exercise, the
learners need to be trained and initially guided in performing their
self assessment. This training involves providing students with
the rationale for self assessment and how it is intended to work
and how it is capable of helping them.
In language teaching and learning, self assessment is relevant in
assessing all the language skills. An example of the self
assessment of the listening skill, especially in the comprehension
of questions asked is suggested by Cohen (1994), as follows:
Comprehension of questions asked:
5.
I can always understand the questions with no difficulties and without

having ask for repetition
4.
I can usually understand questions, but I might occasionally ask for

repetition
3.
I have difficulty with some questions, but I generally get the meaning
2.
I have difficulty understanding most questions even after repetition
1.
I dont understand questions well at all

These questions are useful in the formative stages of
assessment as it helps students identify their own strengths and
weaknesses and respond accordingly. Through asking these
types of self assessment questions, the students are expected
to become more sensitive to their own learning and ultimately
perform better in the final summative evaluation at the end of
the instructional programme.
Peer assessment differs from self assessment in that it involves
the social and emotional dimensions to a much greater extent.
Peer-assessment can be defined as a response in some form to
other learners work (Puhl, 1997). It can be given by a group or
111
an individual and it can take any of a variety of coding systems:

the spoken word, the written word, checklists, questionnaires,
nonverbal symbols, numbers along a scale, colours, etc. (p.8)
Peer assessment requires that a student take up the role of a
critical friend to another student in order to support, challenge,
and extend each others learning (Brooks, 2002: 73). Among
the reported benefits of peer assessment are as follows:
remind learners they are not working in isolation;
help create a community of learners;
improve the product (Two heads are better than one);
improve the process; motivates, even inspires;
help learners be reflective; and
stimulate meta-cognition.
EXERCISE
In your opinion, what are the advantages of using portfolios as
a form of alternative assessment?
112
REFERENCES
Allen, I. J. (2011). Repriviledging reading: The negotiation of
uncertainty.
Pedagogy: Critical Approaches to Teaching
Literature, Language Composition, and Culture, 12 (1) pp. 97-120.
Available at:
http://pedagogy.dukejournals.org/cgi/doi/10.1215/153142001416540(RetrievedSeptember 26, 2013)
Alderson, J. C. (1986b). Innovations in language testing? In M.
Portal
(Ed.), Innovations in language testing. pp. 93-105.
Windsor: NFER/Nelson.
Alderson, J. C., Clapham, C., & Wall, D. (1995). Language test
construction
and evaluation. Cambridge: Cambridge University
Press.
Anderson, L.W. (Ed.), Krathwohl, D.R. (Ed.), Airasian,P.W.,
Cruikshank, K.A.,
Mayer, R.E., Pintrich, P.R.,Raths, J., &
Wittrock, M.C. (2001). A
taxonomy for learning, teaching, and
assessing: A revision of Bloom's
Taxonomy of Educational
Objectives (Complete edition). New York: Longman.
Anderson, K. M., (2007). Differentiating instruction to include all
students. Preventing School Failure, 51 (3) pp. 49-54.
Bachman, L. F. (2004). Statistical Analyses for Language
Assessment. pp.
22-23. Cambridge, UK: Cambridge
University Press.
Biggs, J. B. and Collis, K. F. (1982).Evaluating the Quality of
Learning: the
SOLO taxonomy. New York, NY: Academic Press.
Biggs, J. B., & Collis, K .F. (1991) Multimodal learning and the quality
of intelligent behaviour. In: H. Rowe (Ed.) Intelligence:
Reconceptualization and measurement. Hillsdale, NJ: Lawrence
Erlbaum. pp. 57-75.
113
Biggs, J.B.& Tang, C. (2009). Applying constructive alignment to

outcomes- based teaching and learning. Training Material. Quality
Teaching for
Learning in Higher Education Workshop for Master
Trainers. Ministry
of Higher Education. Kuala Lumpur.
Black, P. & Wiliam, D. (2009). Developing the theory of formative
assessment
J. Gardiner, ed. Educational Assessment
Evaluation and Accountability, 1 (1), pp. 531.
Available at: http://eprints.ioe.ac.uk/1119/. (Retrieved 23 August
2013)
Bloom, B. S. (Ed.). Engelhart, M.D., Furst, E.J., Hill,W.H., &
Krathwohl, D.R. (1956). Taxonomy of educational objectives: The
classification of educational goals. Handbook 1: Cognitive
domain.New York: David
McKay.
Bloom, B. S. (1956). Taxonomy of Educational Objectives, Handbook
I: The
Cognitive Domain. New York: David McKay Co Inc.
Brennan, R. L. (1996). Generalizability of performance assessments.
In G.
W. Phillips (Ed.), Technical issues in large-scale
performance
assessment (NCES 96-802) (pp. 19-58).
Washington, DC: National
Center for Education Statistics.
Brown, H. D., & Abeywickrama, P. (2010). Language Assessment:
Principles and Classroom Practices.New York, NY: Pearson
Education.
Brown, G., & Yule, G. (1983). Teaching the spoken language.
Cambridge: Cambridge
University Press.
Brown, H.D. (1994). Teaching by principles: An interactive approach

to language pedagogy. Englewood Cliffs, NJ: Prentice Hall Regents.
Campbell, K. J., Watson, J. M., & Collis, K. F. (1992).Volume
measurement and intellectual development. Journal of Structural
Learning. 11, pp.
279-298.
Carroll, J. B., & Sapon, S. M. (1958). Modern Language Aptitude
Test. New
York, NY: The Psychological Corporation.
Cheng, L. Watanabe, Y., & Curtis, A. (Eds.). (2004). Washback in
language
testing: Research contexts and methods. Mahwah,
NJ: Lawrence Erlbaum Associates.
Chick, H. (1998).Cognition in the Formal Modes: Research
mathematics and the SOLO taxonomy. Mathematics Education
Research Journal. 10 (2)
pp. 4-26.
114
Clark, J. (1979). Direct vs. semi-direct tests of speaking ability. In E.

Briere & F. Hinofotis (Eds.), Concepts in language testing: Some
recent studies (pp. 35-49). Washington, DC:TESOL.
Davidson, F., Hudson, T. & Lynch, B. (1985). Language testing:
Operationalization in classroom measurement and L2 research.
In M.
Celce-Murcia (Ed.). Beyond basics: Issues and research
in TESOL pp. 137-152. Rowley, MA: Newbury House.
Davidson, F., & Lynch, B. (2002). Testcraft: A teachers guide to
writing and using language test specifications. New Haven, CT:
Yale University Press.
Davies, A., Brown, A., Elder, C., Hill, K., Lumley, T. and
McNamara, T. (1999). Dictionary of language testing. Cambridge:
University ofCambridge Local Examinations Syndicate and
Cambridge University Press.
Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn (ed.).
Educational Measurement. (3rd. ed.) pp.105-146. New York, NY:
Macmillan.
Gottlieb, M. (2006). Assessing English Language Learners:
Bridges from Language Proficiency to Academic Achievement.
USA: Corwin Press.
Grotjahn, R. (1986).Test validation and cognitive psychology:
Some methodological considerations.Language Testing
3,pp.15885.
Hattie, J. (2009).Visible Learning. New York: Routledge.
Hattie, J. (2012) Visible Learning for Teachers: Maximizing Impact
on
Learning. Abingdon: Routledge
Hattie, J. & Brown, G. (2004) Cognitive processes in asTTle: The
SOLO taxonomy. University of Auckland/Ministry of Education.
asTTle Technical Report 43
Hook, P. & Mills, J. (2011) SOLO Taxonomy: A Guide for Schools
Book 1: A
common language of learning. Laughton, UK:
Essential Resources Educational Publishers.
Huang, S.C. (2012).English Teaching: Practice and Critique 11 (4),
pp.
99119.
Hughes, A. (2003). Testing for language teachers (2nd. Ed.).
Cambridge,
MA: Cambridge University Press.
115
Gavin, B. et al. (2008). An introduction to educational assessment,

measurement and evaluation. (2nd ed.). Australia: Pearson
Education New Zealand.
McNamara, T. (2000). Language testing. Oxford, UK: Oxford
University Press.
Linn, R. L., & Gronlund, N. E. (2000). Measurement and
assessment in teaching. (8th ed.). Upper Saddle River, NJ:
Merrill/Prentice Hall.
Malaysia Education Blueprint 2013-2025.
McMillan, J. H. (2001a.). Classroom assessment: Principles and
practice for
effective instruction.(2nd ed.). Boston: MA: Allyn &
Bacon.
Messick, S. (1989). Validity. In R. Linn (Ed.) Educational
measurement. Pp.
13-103. New York, NY:: MacMillan.
Moseley, D., Baumfield, V., Elliott, J., Gregson, M., Higgins, S.,
Miller, J., &
Newton, D. (2005).Frameworks for Thinking: A
handbook for teaching
and learning. Cambridge: Cambridge
University Press.
Mousavi, S. A. (2009). An encyclopedic dictionary of language
testing (4th ed.)
Tehran: Rahnama Publications.
Norleha Ibrahim. (2009). Management of measurement and
evaluation
Module. Selongor: Open University Malaysia.
Nckles, M., Hbner, S. & Renkl, A. (2009). Enhancing selfregulated learning
by writing learning protocols. Learning and
Instruction, 19(3), pp. 259 271. Available
at: http://linkinghub.elsevier.com/retrieve/pii/S0959475208000558
(Retrieved March 26, 2013).
Oller, J. W. (1979). Language tests at school: A pragmatic
approach. London: Longman.
Pearson, I. (1988).Tests as levers for change. In D. Chamberlain
& R. Baumgardner (Eds.), ESP in the classroom: Practice and
evaluation (Vol. 128, 98-107). London: Modern
116
EnglishPublications.
Pimsleur, P. (1966). Pimsleur Language Aptitude Battery. New
York, NY:
Harcourt, Brace & World.
Shepard, L. A. (2000). The role of assessment in a learning
culture. Paper
presented at the Annual Meeting of the
American Educational
Research Association.
Available
http://www.aera.net/meeting/am2000/wrap/praddr01.htm
(Retrieved 10.8.2013)
Smith, A. (2011) High Performers: The Secrets of Successful
Schools.
Camarthen: Crown House Publishing.
Smith, T.W. & Colby, S.A. (2007). Teaching for Deep Learning. The
Clearing
House. 80 (5) pp. 205211.
Spaan, M. (2006). Test and item specifications
development.Language Assessment Quarterly, 3, pp. 71-79.
Spratt, M. (2005). Washback and the classroom: The implications
for teaching and learning of studies of washback from exams.
Language Teaching Research, 19, 5-29.
Stansfield, C., & Reed, D. (2004). The story behind the Modern
Language
Aptitude Test: An interview with John B. Carrol
(1916-2003). Language
Assessment Quarterly, 1, pp.43-56.
Websites
http://www.catforms.com/pages/Introduction-to-Test-Items.html
(Retrieved 9.8.2013)
http://myenglishpages.com/blog/summative-formativeassessment/ - (Retrieved 10.8.2013)
http://www.teachingenglish.org.uk/knowledge-database/objectivetest - (Retrieved 12.8.2013)
http://assessment.tki.org.nz/Using-evidence-for
learning/Concepts/Concept/Reliability-and-validity
117
PANEL PENULIS MODUL

PROGRAM PENSISWAZAHAN GURU
MOD PENDIDIKAN JARAK JAUH
(PENDIDIKAN RENDAH)
NAMA
NURLIZA BT OTHMAN
othmannurliza@yahoo.com
KELAYAKAN
KELULUSAN:
M.A TESL University of North Texas, USA
B.A (Hons) English North Texas State University, USA
Sijil Latihan Perguruan Guru Siswazah (Kementerian

Pelajaran Malaysia)
PENGALAMAN KERJA
4 tahun sebagai guru di sekolah menengah
21 tahun sebagai pensyarah di IPG
ANG CHWEE PIN

chweepin819@yahoo.com
KELULUSAN
M.Ed.TESL Universiti Teknologi Malaysia
B.Ed. (Hons.) Agri. Science/TESL, Universiti Pertanian

Malaysia
PENGALAMAN KERJA
23 tahun sebagai guru di sekolah menengah
7 tahun sebagai pensyarah di IPG
118

Language Assessment Module

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Language Assessment Module

Uploaded by

Copyright:

Available Formats

TOPIC 1

Topic 1 provides you with some meanings of test, measurement, evaluation

define and explain the important terms of test, measurement,

examine the historical development in Language Assessment;

describe the changing trends in Language Assessment in the

Assessment and examinations are viewed as highly important in most Asian

DEFINITION OF TERMS test, measurement, evaluation, and

Figure 1:The relationship between tests, measurement and assessment.

Historical development in language assessment

reflected in large-scale institutional language testing and in most language

writing) and components (e.g. grammar, vocabulary, pronunciation) and an

factors that affect performance on language tests;

authentic, or performance, assessments; and

concerns with the ethics of language testing and professionalising

interested in language testing and assessment research. Current

Changing trends in Language Assessment-Malaysian context

intertwined in education.Assessment and examinations are viewed as highly

On 3rd May 1956, the Examination Unit (later known as Examination

Education has started to implement numerous changes to the examination

Figure 2: The development of educational evaluation in Malaysia

Figure 3: The achievements of Malaysia Examination Syndicate (MES)

person is in a particular language skill area.Their purpose is to describe what

foreign language a priori (before taking a course) and ultimate predicted

BASIC TESTING TERMINOLOGY

The end of the topic. Happy reading!

SESSION THREE (3 hours)

3.3 Norm-Referenced Test

Criterion-Referenced Test (CRT)

the collection of information about student progress or achievement in relation

outcomes or objectives as specified in the syllabus. The main advantage of

feedback teachers give students while the course is progressing. Formative

consists of right or wrong answers or responses and thus it can be marked

Multiple choice items/questions

Fill-in the blanks items/questions.

In this topic, let us focus on the multiple-choice questions, which may

It may limit beneficial washback;

It may enable cheating among test-takers;

It is very challenging to write successful items;

This technique strictly limits what can be tested;

This technique tests only recognition knowledge;

It may encourage guessing,which may have a considerable effect on

Lets look at some important terminology when designing multiple-choice

Receptive or selective response

presents a stimulus). Stem is the question or assignment in an item. It is in a

When building multiple-choice items for both classroom-based and

Design each item to measure a single objective;

State both stem and options as simply and directly as possible;

(Optional) Use item indices to accept, discard or revise item.

Subjective test-items allocate subjectivity in the response given by

BASIC PRINCIPLES OF ASSESSMENT

Topic 4 defines the basic principles of assessment (reliability, validity,

define the basic principles of assessment (reliability, validity,

explain the differences between validity and reliability ;

distinguish the different types of validity and reliability in tests

understanding, and knowledge-in the exercise of professionally judgment. In

Reliability means the degree to which an assessment tool produces

4.4.1 Rater Reliability

When humans are involved in the measurement procedure,

scale from 1 to 10. Next, you would calculate the

correlation (connection) between the two ratings to determine the level

percentage of agreement between the raters. So, if

glossed over. As such, intra rater reliability determines the

4.4.3 Factors influencing Reliability