423 Full

Language Testing 2009 26 (3) 423 443
Assessing paired orals: Raters

orientation to interaction
Ana Maria Ducasse La Trobe University, Australia
Annie Brown Ministry of Higher Education and Scientific
Research, United Arab Emirates
Speaking tasks involving peer-to-peer candidate interaction are increasingly

being incorporated into language proficiency assessments, in both large-scale
international testing contexts, and in smaller-scale, for example course-related,
ones. This growth in the popularity and use of paired and group orals has stim-
ulated research, particularly into the types of discourse produced and the possi-
ble impact of candidate background factors on performance. However, despite
the fact that the strongest argument for the validity of peer-to-peer assessment
lies in the claim that such tasks allow for the assessment of a broader range of
interactional skills than the more traditional interview-format tests do, there
is surprisingly little research into the judgments that are made of such per-
formances. The fact that raters, and rating criteria, are in a crucial mediating
position between output and outcomes, warrants investigation into how raters
construe the interaction in these tasks. Such investigations have the potential
to inform the development of interaction-based rating scales and ensure that
validity claims are moved beyond the content level to the construct level.
This paper reports the findings of a verbal protocol study of teacher-raters
viewing the paired test discourse of 17 beginner dyads in a university-based
Spanish as a foreign language course. The findings indicate that the raters
identified three interaction parameters: non-verbal interpersonal communica-
tion, interactive listening, and interactional management. The findings have
implications for our understanding of the construct of effective interaction
in paired candidate speaking tests, and for the development of appropriate
rating scales.
Keywords: paired interaction construct, rating oral proficiency
1 Introduction
Oral interviews have been the preferred method for the assessment of
foreign or second oral language ability since the inception of modern
Address for correspondence: Ana Maria Ducasse, La Trobe University, Spanish Program, School
of Historical and European Studies, Victoria 3086, Australia; email: A.Ducasse@latrobe.edu.au
The Author(s), 2009. Reprints and Permissions: http://www.sagepub.co.uk/journalsPermissions.nav

DOI:10.1177/0265532209104669
424 Assessing paired orals
oral proficiency testing, through the Cambridge English Exams in

the first part of the twentieth century, and subsequently through
the highly influential FSI/ACTFL interviews (see Brown, 2005 and
Fulcher, 2003, for a history of oral interview testing). However, as a
result of progressive changes in the view of what speaking in tests
should consist of, since the late 1980s, pair or group tasks involving
peer-to-peer interaction have increasingly been used.
The move away from interview to peer-to-peer testing is linked
to claims of positive washback on the classroom (Hildson, 1991;
Messick, 1996). With the increase in use of pair and group work in
language classrooms, such tasks were also said to be more represent-
ative of best practice in classroom activities (Morrison & Lee, 1985;
Taylor, 2001; Egyud & Glover, 2001). From a pragmatic perspec-
tive, peer-to-peer assessment is typically also more time and cost
efficient as candidates are tested together, and raters assess two or
more candidates simultaneously. However, the main motivation for
this shift to the testing of paired or grouped candidates, was the reali-
zation that interview tests resulted in test discourse or institutional
talk, and did not represent normal conversation (van Lier, 1989)
or provide candidates with the opportunity to show their ability to
participate in interaction other than as an interviewee, responding
to questions.
The shift from the view that speaking in a second language
(L2) generally meant information transfer to an acknowledgement
that speaking involved negotiating meaning derived from sec-
ond language acquisition (SLA) research (Gass & Varonis, 1985;
Long, 1983). The term interactional competence was coined by
Kramsch (1986) and later taken up by He and Young (1998) within
Interactional Competence Theory. Interactional competence can
be defined in terms of how speakers structure and sequence their
speech, and how they apply turn-taking rules. It can also be defined
in terms of how speakers collaborate and support their interactional
partner to co-construct the spoken performance (Jacoby & Ochs,
1995). This interactional view of communication had a huge influ-
ence on language teaching, initially in the 1980s, but subsequently
also on testing. In testing, the recognition of the importance of inter-
action and negotiation of meaning raised the question of whether
the dialogue between interviewer and candidate in an interview
had any parallels with non-test casual conversation and, from the
mid 1990s, testers were challenged to acknowledge that spoken
language displayed in a test required a social view of performance,
Ana Maria Ducasse and Annie Brown 425
taking account of the bearing interlocutors had on each other during

co-constructed interaction (Jacoby & Ochs, 1995; He & Young,
1998; McNamara, 2001).
These theoretical challenges to the validity and supremacy of
the oral interview led to an increase in empirical studies of the
nature of oral interview discourse during the 1990s. It was found
that interview discourse was characterized by a power differential,
and that turn taking, topic organization, sequence and the overall
structure were predetermined or controlled by the interviewer.
Turn-taking was characterized by questions (the interviewer) and
answers (the candidate), with little opportunity for the candidate to
introduce topics or control the direction of the interaction (Perrett,
1990; Lazaraton, 1992; Young & Milanovic, 1992; Johnson, 2001;
Csepes, 2002).
More recently, studies of peer-to-peer interaction have provided
insights into the type of discourse produced by candidates when the
interviewer/rater plays a minimal role. Peer-to-peer interaction has
been found to be more balanced (Eygud & Glover, 2001) and inter-
active (ffrench, 1999), with candidates producing a greater range
of functions (Kormos, 1999; Lazaraton, 2002) and interactional
patterns being more varied (Saville & Hargreaves, 1999). Taylor
(2001) reported a greater percentage of interactional and conver-
sational management functions in the paired tasks. Galaczi (2004)
examined the interactional behaviours of high- and low-scoring stu-
dents on a paired speaking task, and found that successful pairs (who
had higher scores on the criterion Interactive Communication)
achieved these higher scores by making topic-building moves. In
short, peer-to-peer tasks have been found to provide the poten-
tial for a wider range of functional and interactional moves than
is generally possible in the more traditional interviewer-led oral
interview.
1 Assessment of peer-to-peer interaction

While, as this range of studies attests, considerable attention has
been paid to the discourse produced in peer-to-peer test perform-
ances, there has not been the same attention paid to the other,
equally important, side of the construct the criteria by which
the performance is judged, and the process of rating, which medi-
ates between the discourse and the score. While it is important to
understand the impact that peer-to-peer tests have on the language
performance of the learners, it is important also to investigate what

language experts teachers and assessors value while rating
pairs, because it is their view of interaction which finds its reflec-
tion in the test scores, and it is these scores that are used to draw
inferences about language learner abilities. As such it can be argued
that the validity of the test depends as much on the criteria and the
rating procedure as on the nature of the task performance, because
in any assessment involving judgment it is the criteria by which
the performance is judged which define the construct (Brown,
2005, p. 26).
Nevertheless, research into the rating of peer-to-peer assess-
ments is both more recent and more scarce. Most studies which
have focused on the rating of paired or group orals have tended
to focus on the relationship between scores and learner character-
istics (see, for example, Berry, 1997; Iwashita, 1998; OSullivan,
2002; Norton, 2005). One noteworthy example of a study focus-
ing on the raters orientations is that of May (2006a, b). May
used retrospective verbal reports to examine the criteria used
by raters of paired candidate discussion tasks in an EAP setting.
She found that raters incorporated body language, assertiveness
through communication, and the ability to manage the discus-
sion and work together cooperatively into their assessments of
Effectiveness (the most interactional of the criteria). Orr
(2002), in an analysis of verbal reports produced by raters of a
paired oral test, found evidence to suggest that the construct might
involve non-verbal communication. Other studies, while focusing
on discourse rather than raters, have provided evidence of notice-
able differences between performances at different levels. Galaczi
(2004), for example, found that discourse on paired tasks at the
lower proficiency level was not connected because the speakers
failed to work with each other, whereas higher proficiency learn-
ers develop[ed] the ability to work with their interlocutor [and]
shift more successfully between the role of listener and speaker
(Galaczi, 2004, p. 264), and on the basis of this finding proposed
the existence of a lower and a higher level conversational manage-
ment ability criterion.
2 Rating scale development and the assessment of peer-to-peer

performance
While language test developers have been swift to embrace the
concept of paired or group orals involving peer-to-peer assessment,
taking into account that peer-to-peer tasks open up the possibility

of enriching our construct definition (Fulcher, 2003, p. 189), it
follows that this complex construct needs to be identified and
described. However, modifications to scales to incorporate a focus
on interactional skills have tended to be both intuitive and vague,
although as Lazaraton (2002) points out, this is starting to change.
She notes that, over time, the two interactional components of the
scoring rubric of the FCE Oral test, interactive communication and
discourse management, have been developed drawing on a mix of
approaches.
Rating scale development methods can be divided in two
broad types: intuitive and evidence-based methods (Fulcher, 2003).
Although the intuitive method is by far the most common way of
arriving at rating scales, drawing predominantly on expert judgment
and experience (North & Schneider, 1998), it has been criticized for
producing scales which lack firm empirical substantiation in respect
to evidence about L2 (Cumming, Kantor, & Powers, 2002, p. 68),
and for leading to vagueness and generality in the descriptors used
to define bands (Fulcher, 2003, p. 96). Assessment criteria derived
from empirical data, on the other hand, and hence based on what is
both observable and quantifiable, could not only address problems
of reliability but also contribute to the validity argument, at least in
cases where the developer of the scale is responsible for the report-
ing mechanism and the associated explanations and interpretations
of the scores, by enhancing the link between the score and the infer-
ences that are drawn from it.
There are two main approaches to the development and valida-
tion of data-based scales: analysis of learner speech or writing
samples (e.g. Fulcher, 1996; Brown, Iwashita, & McNamara,
2005) and analysis of rater orientations (e.g. Turner & Upshur,
1996; Pollitt & Murray, 1996; Norris, 2001; Brown, Iwashita &
McNamara, 2005). However, while there have been a growing
number of such studies in relation to the assessment of oral lan-
guage over the last decade, most have so far involved interviews,
not peer-to-peer interaction, and those that have examined peer-to-
peer interaction have mainly focused on the discourse, not on the
identification of features attended to by raters when judging the
effectiveness of performance. An analysis of verbal reports pro-
duced by raters when assessing peer-to-peer interaction would
complement these discourse studies and provide valuable insights
into the aspects of interaction salient to judgments of interactional
competence.
II Aim of the study

This study takes up the question of what raters focus on when rating
paired candidate interaction. It seeks to examine whether the differ-
ences between interview and peer-to-peer candidate talk that have
been evinced in the arguments supporting the use of peer-to-peer test
tasks, are actually salient to assessors, describable, and assessable.
The focus of the paper is, therefore, the construct of interaction.
III Context of the study

The context of the study is an Australian university Spanish beginner-
level course, taught across two universities. In both universities a
paired oral test is administered at the end of the second semester.
Students take it after one hundred and four hours of formal language
instruction over the two semesters. The paired (peer-to-peer) inter-
action constitutes the whole of the speaking test in which the rater
plays a minimal role as a facilitator.
The data consisted of 17 videotaped operational tests, collected
during the end of semester testing from 17 pairs of candidates
(Figure 1). The candidates, first-year undergraduate beginners,
volunteered for the project in exchange for personalized individual
feedback on their performance. As was the case for all the students
completing the course assessment, they were acquainted with one
another and chose partners from their cohort with whom to take the
paired oral.
The 12 raters all had current or previous teaching experience
in the beginners program, where they had taught entirely in
the target language and had been trained to assess the students.
Eleven of the raters were native speakers of Spanish and one
Test data Verbal report data Verbal report analysis

17 videos of 12 raters individually Transcription of verbal
operational orals: observe and record reports from 12 raters
Paired candidates comments on 3 assigned cassette tapes.
speak for 10 minutes paired candidates.
in total on one task. Content analysis of
Each candidate pair is transcription to develop
observed at least by coding grid.
2 different raters.
Figure 1 Data collection and analysis
was a non-native speaker. The native speakers were from Spain

(eight) and Latin America (four). Three raters were male, and
nine were female. The raters had a range of qualifications from
undergraduate degrees to postgraduate teacher training and doc-
toral degrees. Three were language teaching assistants funded by
the Spanish government.
IV The elicitation task

The paired task in this study is a discussion task. Candidates are
required to engage in a discussion with their peers, covering a set of
given topics within a specified time. The raters do not speak at all to
the pair; they participate only in the role of rater and are explicitly
instructed to refrain from guiding the pair at any time.
Each pair is given a card with three topics, as on the sample task
cards in the Figure 2 (in the exam there are no English translations
on the cards). The topics are drawn from the course content and are
set out in Spanish. After reading the card, candidates begin speak-
ing to each other on any one of the topics, with the further task of
introducing, maintaining and changing between the three topics. The
time allowed for the task is 10 minutes.
V Methodology
A method that is increasingly being used in language testing studies
to gain insight into the rating activity is the use of verbal protocols
Tiene 10 minutos en total

La familia (The family)
Los das festivos (Public holidays)
Los amigos (Friends)
Tiene 10 minutos en total

Los fines de semana (The weekend)
Los pasatiempos (Hobbies)
Las vacaciones de verano (Summer holidays)
Figure 2 Sample task cards

to examine rater cognition during the rating process (Ericsson &

Simon, 1993; Green, 1998). Because of the on-line nature of oral
assessment, retrospective reports and stimulated recalls, rather than
concurrent reports, are the norm.
This study uses think-aloud protocols. To this point verbal proto-
cols have not been used to investigate L2 Spanish rating of begin-
ner peer-to-peer pairs. This methodology allows the researcher to
uncover key features that reflect the views of the participant raters
while they are simulating carrying out ratings.
The content analysis of the discourse from the verbal reports in
this study allows an in-depth examination of the nature of the raters
orientation towards interactional features. This is possible by ask-
ing the raters to make explicit their perceptions of what interaction
consists of as a construct, by focusing on what they attend to when
focusing on interaction in a paired task. It is important to note that
the raters are not guided as to what features of interaction they
should consider when watching.
The purpose of the content analysis is to gain insight into how
raters operationalize construct of peer-to-peer interaction as a pre-
cursor to developing an evidence-based rating scale. Analysis of the
content of the rater protocols provides insights across a range of rat-
ers in relation to a range of candidates.
VI Collection and transcription of verbal reports

Each of the 12 raters was given a CD with video clips of three
different pairs of candidates, an audiotape for recording their
comments and a set of instructions.1 Each pair of candidates was
commented on by at least two different raters (Rater 1 candidates
1, 2, 3; Rater 2 candidates 2, 3, 4; etc.). In order to prevent any
effect that position in the order of the viewing might have on the
quantity or quality of the raters comments, the performances were
sequenced in such a way as to ensure that no pair was repeatedly
presented first or last in the sets of three.
The raters were asked to record a verbal report in two steps after
watching each of the performances assigned to them. For each
1 The raters were dispersed internationally and around Australia making it impossible to provide
face-to-face training.
performance, the verbal reporting consisted of two distinct but linked

steps. First, the raters watched the entire video clip performance of a
single pair without stopping and recorded a summary (a retrospec-
tive report) of their impression of the paired interaction. The raters
were asked to focus specifically on the interaction, including, but not
restricted to, what was said and how it was said, and were asked to
provide reasons they thought could account for the success or other-
wise of the interaction. In the second step, they were asked to watch
the same performance for a second time, this time stopping the tape
at intervals to record comments (a stimulated recall). Raters paused
when they noticed something significant and recorded a comment
about the pairs interaction. It should be noted at this point that the
verbal reports were produced individually, with no prior discus-
sion as to what interaction meant. This was an important factor in
obtaining unguided observations.
These two steps were repeated three times till the allotted three
ten-minute clips had been first summarized and then commented on
in detail.
The 36 verbal reports (12 raters on three paired candidates each)
were transcribed orthographically. For each report, the tape was
replayed and the transcription was checked against the original.
It was evident from the transcripts that some pairs of candidates
inspired more commentary than others. In addition not all raters ver-
balized to the same extent about what was happening in the interac-
tion they were observing.
VII Analysis of the verbal report data

For the purposes of this study, one third of the data was used, that is
one report from each rater.2 Each report was divided into segments.
The first segment was the summary first impression made by the
rater after watching once. Then each new segment was marked by
the stopping and starting of the tape at intervals for comment, while
the raters watched for a second time.
Next the segments were divided into units for analysis by the
main researcher. These units are defined by Green (1998, p. 19)
2 The full set of data was used to develop the rating scale, see Ducasse
(2008).
as a single or several utterances with a single event or task as the

focus expressed as one idea. Repetitions and elaborations were
not recorded as new units. One segment from Rater 2 observing pair
4 was divided into ideas units like this (/ marks a break between
ideas units):
Jane didnt understand so Mary repeats and rephrases about three times in the
end she gives up and then she says my birthday was ... when was yours and
finally Jane got the gist of what she was being asked./ Jane keeps sighing /
and Mary is good in her responses she uses more aha yes si entiendo both
candidates dont do that often there is not that bridge in the conversation the
long pauses make the interaction quite hard to follow/
The next step was to code the ideas units. First the main researcher
scanned the ideas units for particular orientations, focusing on
aspects of successful and unsuccessful interaction. In this way a set
of coding categories emerged from the data.
Defining categories involves repeated data reduction, rearrang-
ing and recoding as the researcher cycles through the data. Green
(1998) recommends finding a balance between too many and too
broad categories to achieve rater reliability. Multiple categories
were created at the discovery stage; these were subsequently
grouped by theme after more coding. This procedure was repeated
until five categories that dominated the data emerged. They were
as follows:
1) body language
2) listening
3) turn taking
4) topic cohesion
5) anything other than interaction.
Next, the researcher and another coder independently coded the
full data set using these categories. Using a formula proposed by
Hatch and Lazaraton (1991), the total number of agreements on
a) division into ideas units and b) coding of ideas units was divided
by the total number of ratings. This resulted in inter-coder agree-
ment of 84%.3
The disagreements in the double-coded data were discussed and
resolved to arrive at the final set of categories, and an agreed set of
ideas units and codings. The main cause of disagreement was the
3 Storch (2001) indicates that in discourse studies, the level of agreement is often in the vicinity of
80% of the total data coded.

coding of questions under both turn taking and topic cohesion. This
problem is acknowledged by Galaczi (2004, p. 97), who noted that
the multi-functionality of questions as both topic management and
conversation management devices caused some discrepancy in the
coding. Ultimately, in the present study the two categories were
joined to become interactional management, but including turn tak-
ing and topic cohesion, as subcategories.
As the level of inter-coder agreement was considered adequate,
the remaining two thirds of the data were recoded only where the
categories had been revised.
VIII Results
The final coding was carried out with three main categories of inter-
actional features. Each of these is defined and discussed below with
examples taken from the rater verbal-protocol transcriptions.
Each of the three categories is listed with its subcategories
before being defined. The first category is non-verbal interpersonal
communication. It consists of two subcategories: gaze and body lan-
guage. The second category is interactive listening with two subcate-
gories: supportive listening and comprehension. The third category
is interactional management with two subcategories: horizontal
and vertical management. The final category encompassed all the
comments that were irrelevant to interaction. These are not elabo-
rated further here, but included comments on linguistic resources or
vocabulary, appraisals of the performance, or the candidates level
of acquaintanceship. They also included comments on the exam or
the task.
1 Non-verbal interpersonal communication

The first category included comments on non-verbal interpersonal
communication between the candidates. It included two subcatego-
ries: gaze, and all other body language including gesture. The raters
would be capable of commenting to some degree on this category
even if the sound was turned off and the clip was watched in silence.
It is evidenced through the presence or lack of a physical non-
verbal fluency between the pair. It includes the flow of eye move-
ment, and of gesture and body positioning, that physically support
what takes place verbally in the interaction.
To take an example from the gaze subcategory, raters commented

on whether the candidates looked at each other or not during the per-
formance. Rater 1 commented on the presence of positive gaze:
What works really well with them is that they look at each other and never
lose the thread of the conversation. (Rater 1, Pair 1)
Candidates that did not look at each other also attracted comments
on the lack of gaze. This was seen as a negative attribute. Rater 6
comments:
They look at the paper not at each other. (Rater 6, Pair 13)
Body language, the other subcategory for non-verbal communica-
tion, could also be viewed as a positive or negative factor in success-
ful communication. In relation to hand gestures, for example, there
are both positive and negative comments. In this comment the use of
hands is positively appraised:
I find the use of the hands ... I find the girl with the glasses uses her hands
when she talks. It gives a nice color and is more in tune with the Latin
American speech and culture; and its her way of expressing her feelings.
It helps her interaction to be more positive and fluent. (Rater 2, Pair 3)
However, hand gestures elicit the opposite reaction when they are
used to support difficulties in conveying meaning. Rater 5 comments
here that:
There are too many gestures and that gives me the impression that they lack
verbal resources. (Rater 5, Pair 12)
2 Interactive listening
The second category was drawn from the comments made by raters
on the candidates manner of displaying attention or engagement
while listening during the interaction. Listening as part of successful
interaction was divided into two subcategories: comprehension, and
a different type of listening termed supportive listening.
The first subcategory, comprehension, is a means of showing
engagement, of giving encouragement for the speaker to continue
or demonstrating comprehension via verbal support. Comments fell
into this category when raters noticed that candidates are filling a
silence, asking for clarification or comprehension.
In the case of filling a silence the raters noticed that a candidate
provided the word the other partner was searching for, as in the
example: She sometimes filled in with a missing word to help
(Rater 1, Pair 3). This showed that the partner has been attending and
comprehended sufficiently to predict a missing word, thus enabling
the interaction to continue. Conversely, if one of the speakers did not

fill a silence this was also remarked on by the raters. The following
two examples are from two raters viewing the same performance.
I think she could have helped out finding the words when the other girl was
thinking for a long while. You have to be patient but she could have helped
with a word or a question. (Rater 10, Pair 17)
The girl is struggling with the next question. There is no attempt by the
partner to help her out or to put words into her mouth. She just sat there and
waited for her to get out of the predicament. (Rater 8, Pair 17)
The raters interpret this to indicate that the person who is listening
is not engaged or not supporting the speaker by either signaling for
the speaker to go on or by taking the floor and offering to break the
silence. To the raters, such behaviour represents un-interactive or
unsuccessful listening. Comprehension, as a subcategory of interac-
tive listening, also included demonstrating comprehension by com-
menting on the partners contributions, or failing to do so. Rater 2
commented on Pair 5 as follows:
What he says relates to what has been said before. They make comments
about what each other is saying and it makes the conversation look as if
they are really interested. (Rater 2, Pair 5)
It can also include offering or requesting clarification, which allows
the dialogue to continue:
He doesnt understand so he is given an example to help him. (Rater 8,
Pair 10)
Another way of facilitating is to clarify what the other person has said.
(Rater 10, Pair 2)
Apart from filling silences, demonstrating comprehension, and clari-
fying, another type of support offered through listening attentively
is back-channeling. This is not unlike body language, except that
back-channels such as si, si or aja (in Spanish) are audible as well
as visual. They may be accompanied by non-verbal communication
or physical back-channel. This kind of support provides feedback
while the other speaker maintains the floor; they are sounds that are
interpreted as a candidate who is interactively listening:
The girl with the glasses used a lot of back channel, a lot of confirmation
mm, si, ah si. She was very responsive and her body language showed a
lot of confidence. She used a lot of physical prompts to continue the interac-
tion with more inflection and intonation in her voice. (Rater 2, Pair 3)
Back-channeling falls into a category termed supportive listening,
rather than comprehension, because, despite encouraging the other
speaker to continue, such sounds do not necessarily convey any
signs of comprehension. Of the two types of interactive listening,

the first required evidence of comprehension by the listener; the
second simply required the provision of audible support with sounds.
However, both types of interactive listening are deemed to support
the interaction and were attended to by the raters.
3 Interactional management
The third feature of successful interaction identified by the raters
emerged from comments on the management of the topics and turns.
This can be theorized from different perspectives. Between adjacent
turns it could be viewed as horizontal management that makes
the conversation flow. However, across topics it could be termed
vertical management that exhibits a flexibility that allows switch-
ing between topics. The raters comments on turn taking show how
interaction is managed horizontally. Elements connected to speaker
change such as speed of response, turn length or domination come
under this category, for example:
They are incapable of comprehending and replying quickly. He takes a long
while to answer. (Rater 10, Pair 2)
It is important to listen and to allow time to respond, to be sensitive to taking
turns and not dominating. (Rater 10, Pair 17)
These comments refer to two aspects of turn taking. First, there is
the need to reply within a reasonable time. Second, it is necessary to
leave time for the other to respond.
A second type of interactional management connected topics
vertically down the complete oral text. Raters commented on candi-
dates ability to connect topics:
She is following the conversation and as a consequence it is all interrelated.
(Rater 1, Pair 2)
They commented on candidates ability to develop the conversation
by extending the topic:
He finds something related to what has gone before. (Rater 1, Pair 2)
And they commented on whether candidates facilitated interaction:
Topics are connected and he is following the conversation he tries to make
sense with the person on the right. (Rater 6, Pair 15)
In this conversation they are asking the right kinds of questions and it
gives coherence to what they are talking about; quite good for beginners.
(Rater 9, Pair 2)
Both turn change and topic cohesion were viewed by raters as

indicators of successful interaction. As long as candidates made
sense across turns, by asking relevant questions for example, they
were connecting horizontally. In this example where the speaker
introduced a new topic and it was taken up, the vertical flow was
being maintained contemporaneously:
If one does not know what to say the other helps by changing the topic, and
another thing that makes it natural is that for example they both interrupt
each other with more questions. (Rater 5, Pair 12)
VI Discussion
In this section the results from the final reduction of the data to
the three categories are discussed: first interpersonal non-verbal
communication, then interactive listening and finally interactional
management skills.
1 Interpersonal non-verbal communication: An interlocutor

in a pair demonstrates communication through non-verbal
communication using gaze and body language
The raters found body language, or non-verbal behaviour, to be a
contributing feature to the success or lack thereof of interpersonal
interaction, as also found by May (this volume). This does not
appear in the literature on paired test performance nor is it yet widely
studied in second language acquisition, as is discussed by Lazaraton
(2004). Its salience to the raters here, however, indicates that it is an
important aspect of interactional ability, and that different types of
non-verbal language provide evidence of ability or lack of ability.
There is also some evidence that the raters view non-verbal language
to be, at least to some extent, culture-specific.
2 Interactive listening: An interlocutor in a pair actively

demonstrates communication by indicating comprehension
to the speaker or through supportive audible feedback
With regard to interactive listening, the findings supported those
of Galaczi (2004) in that the raters here also found that successful
interaction involved candidates moving between the role of listener
and speaker. This contrasted with unsuccessful interaction, where
evidence of interactive listening was absent. Thus, listening is

identified as having a crucial role to play in successful interaction.
Participants have two roles to play: that of listener and speaker. If
both candidates perform both roles successfully, then fluency across
the pair prevails. These features could be an elaboration of fluency
between two people, which is the essence of co-constructed dialogue
(Jacoby & Ochs, 1995) compared to individual fluency in extended
speech as in a monologue.
An issue to consider in relation to interactive listening, however, is
whether such supportive behaviour can mask a lack of comprehension.
Comments produced by candidates watching their own performances
later (Ducasse, 2007) would seem to indicate this was indeed the case,
at least some of time: Most of the time I could understand what she
was saying but at that point I just couldnt understand so I just kept
saying si si (Pair 12, Candidate 25).
An implication of this is that raters might potentially jump either
way with such behaviour, interpreting them positively (providing
interactional support) or negatively (a lack of comprehension).
Back-channels could also be seen (positively) as an aspect of stra-
tegic competence, whereby a listener encourages the speaker to
continue until they reach a point where they understand sufficiently
to re-join the conversation. This situation has an analogy with the
findings of Brown (2003) and Brown, Iwashita and McNamara
(2005) that self-repair was sometimes judged negatively, as
interfering with comprehensibility, and sometimes positively,
as evidence of test takers ability to monitor and correct their own
performance.
3 Interactional management skills: An interlocutor in a pair

demonstrates communication in the current turn or over
different topics
The last two elements constituting interactional management turn-
taking (or horizontal cohesion) and topic (or vertical cohesion) are
not completely new in the context of paired assessment. Tests that
have paired tasks, such as the Cambridge Suite, have already rec-
ognized the need for criteria that are worded to incorporate con-
versation management. The finding that vertical and horizontal
cohesion are important is perhaps less surprising than body language
or listening, because they have been reported in previous findings on
conversational management, although not from the perspective of
raters orientations. The raters in this study, however, by explicitly

focusing on these features in their assessments have confirmed
ffrenchs (1999) finding that that there are more interactional and
conversational management functions in a paired task than in an
interview as a salient difference in terms of the interaction con-
struct. The manner in which turns are organized is recognized by the
raters as a feature contributing to successful interaction.
On being asked to comment on the interaction, the raters were
making explicit how they viewed, interpreted and thus operational-
ized the interaction construct. Just as it would be difficult to incor-
porate into all aspects of all features at all levels in writing, for
example, into a useable scale, similarly we need to identify from the
mass of oral interactional behaviours those that are attended to by
raters when making judgements in order to develop a useful and use-
able scale. The findings call language testers to seriously consider
the features of a construct for peer-to-peer interaction that includes
nonverbal communication, listening, and turn and topic management
as the key features.
The question of whether interaction should be assessed is
broader than the scope of this paper warrants. It may be valid in
some contexts to argue that interaction, as defined by the findings
of this study, is not taught within second language classes, and
therefore not relevant to assessments of second language profi-
ciency. Yet given the centrality of pair and group activities within
the communicative classroom, it may be simply that the teachers
have not stopped to consider long enough what they understand
by interaction. We have set down the manner in which interac-
tion is construed in a particular context on a particular task where
those teaching the course now know what they mean by successful
interaction. By developing an operational definition, as was the
case in this context, both assessment and teaching can perhaps be
enhanced, with the end result that more communicatively compe-
tent learners will emerge.
IX Conclusion
Egyud and Glover (2001) posed the question that if paired interaction
is, in fact, different from interview interaction, how should it be
accounted for in rating scales? This study has provided an initial
response to this question. It has done so by drawing on verbal reports
produced by teacher/raters in order to identify the basic components

of a rating scale for interactional ability, thus providing an opera-
tional definition of the construct, and heeding Pollitt and Murrays
caution that as a set of scale descriptors should closely match what
raters perceive in the performances they have to grade, the starting
point for scale development should surely therefore be a study of the
perceptions of proficiency by raters in the act of judging proficiency
(1996, p. 76).
It has shown, moreover, that the essential difference posited for
peer-to-peer interaction as compared with interview interaction,
namely the equal flow of the conversation as the two interactants
move equally between speaker and listener roles, and participate
equally in the management of the interaction, is not only visible in
the discourse but is observable and assessable by raters.
As paired and group tasks continue to increase in popular-
ity, the need for the studies such as the one reported here will
increase, if test developers, be they testing specialists in test
development organizations or groups of teachers working a
common curriculum, are to ensure that the criteria used to judge
performance are valid. The development of interaction scales
that have a basis in observable features which are salient to
raters can only strengthen the claims that can be made about
the validity of the scores as reflections of interactional ability.
It remains to be seen whether the features identified here are
transferable to other test tasks and contexts; this is an empirical
question.
In terms of methodology, this study has shown that the verbal
report methodology, which has been successfully used in the past
to develop and validate linguistic criteria (e.g. Cumming, Kantor &
Powers, 2002; Brown, 2000; Brown, Iwashita & McNamara, 2005)
can also be successfully applied in the development of an interac-
tion scale for speaking. Despite not having a defined model of
interaction in advance of the study, the teacher / raters were able to
verbalize and to a large extent agree on the salient features through
examination of student performance.
This, however, is not the end of the story. While the features of
the construct have been defined, the scale remains to be developed.
The future direction of this study will be to develop descriptors
which define performance at different levels of interactional pro-
ficiency. Once that is done, we will be able to measure effective
interaction both qualitatively and quantitatively.
X References
Berry, V. (1997). Ethical considerations when assessing oral proficiency in pairs.
In A. K. Huhta, V. Kurki-Suonio, & S. Luoma (Eds.), Current develop-
ments in language testing. Jyvaskyla: Jyvaskyla University Press.
Brown, A. (2000). An investigation of the rating process in the IELTS speaking
module. In Tulloch, R. (Ed.), IELTS research reports, Vol. 3, ELICOS,
Sydney.
Brown, A. (2003). Interviewer variation and the co-construction of speaking
proficiency. Language Testing, 20(1), 125.
Brown, A. (2005). Interviewer variability in oral proficiency interviews.
Frankfurt am Main: Peter Lang.
Brown, A., Iwashita, N., & McNamara, T. (2005). An examination of rater ori-
entations and test-taker performance on English-for-academic-purposes
speaking tasks (TOEFL Monograph No. TOEFL-MS-29), Educational
Testing Service, Princeton, NJ.
Csepes, I. (2002). Measuring oral proficiency through paired performance.
Unpublished PhD dissertation, Etvs Lornd University, Budapest.
Cumming, A., Kantor, R. & Powers, D.E. (2002). Decision making while rat-
ing ESL/EFL writing tasks: A descriptive framework. Modern Language
Journal, 86, 6796.
Ducasse, A. (2007). How do candidates view interaction in a paired oral?
In C. Gitsaki (Ed.), Language and Languages: Global and Local Tensions
(pp. 184200). Newcastle, UK: Cambridge Scholars Publishing.
Ducasse, A. (2009). An empirically based rating scale for interaction in a
Paired Test. In Brown, A., & Hill, K. (Eds.), Tasks and criteria in per-
formance assessment: Proceedings of the 28th Annual Language Testing
Research Colloquium (pp. 122). Frankfurt: Peter Lang.
Egyud, G., & Glover, P. (2001). Oral testing in pairs: A secondary school per-
spective. ELT Journal, 55(1), 7076.
Ericsson, K., & Simon, H. (1993). Protocol analysis: Verbal reports as data
(revised edition). Cambridge, MA: MIT Press.
ffrench, A. (1999). Study of qualitative differences between CPE individuals
and paired test formats. Internal UCLES EFL Report.
Folland, D., & Robertson, D. (1976). Towards objectivity in group oral testing.
ELT Journal, 30, 156167.
Fulcher, G. (1996). Does thick description lead to smarter tests? Language
Testing, 13(2), 2351.
Fulcher, G. (2003). Testing second language speaking. London: Pearson
Education.
Galaczi, E. (2004). Peer-peer interaction in a paired speaking test: The case of
the First Certificate in English. Unpublished PhD dissertation, Teachers
College, Columbia University, New York.
Gardner, R. (2001). When listeners talk: Response tokens and listener stance.
Amsterdam: John Benjamins.
Gass, S., & Varonis, E. (1985). Task variation and non-native/non-native
negotiation of meaning. In S. Gass & C. Madden (Eds.), Input in second
language acquisition (pp. 149161). Rowley, MA: Newbury House.
Green, A. (1998). Verbal Protocol analysis in language testing research:

A handbook (Vol. 5). Cambridge: Cambridge University Press.
Hamp-Lyons, L., & Lynch, B. (1998). Perspectives on validity: An historical
analysis of language testing abstracts. In A. J. Kunnan (Ed.), Validation
in language assessment. Mahwah, NJ: Lawrence Erlbaum.
Hatch, E. M. & Lazaraton, A. (1991). The research manual: Design and statis-
tics for applied linguistics. New York: Newbury House.
He, A. W. & Young, R. (1998). Language proficiency interviews: A discourse
approach. In R. Young & A. W. He (Eds.), Talking and testing: Discourse
approaches to the assessment of oral proficiency (pp. 124). Amsterdam:
John Benjamins.
Hildson, J. (1991). The group oral exam: Advantages and limitations. In
J. C. Alderson & B. North (Eds.), Language testing in the 90s: The communi-
cative legacy. London: Modern English Publications and the British Council.
Iwashita, N. (1996). The validity of the paired interview in oral performance
assessment. Melbourne Papers in Language Testing, 5(2), 5165.
Jacoby, S., & Ochs, E. (1995). Co-construction: An introduction. Research on
Language and Social Interaction, 28, 171183.
Johnson, M. (2001). The art of non-conversation: A re-examination of
the validity of the oral proficiency interview. New Haven, CT: Yale
University Press.
Kormos, J. (1999). Simulating conversations in oral proficiency assessment:
A conversation analysis of role plays and non-scripted interviews in lan-
guage exams. Language Testing, 16(2), 163188.
Kramsch, C. (1986). From language proficiency to interactional competence.
The Modern Language Journal, 70, 366372.
Lazaraton, A. (1992). The structural organisation of a language interview:
A conversation analytic perspective. System, 20, 373386.
Lazaraton, A. (1997). Preference organization in oral proficiency interviews.
Research on Language and Social Interaction, 30, 5372.
Lazaraton, A. (2002). A qualitative approach to the validation of oral language
tests. Cambridge: UCLES/Cambridge University Press.
Lazaraton, A. (2004). Gesture and speech in the vocabulary explanations of one
ESL teacher: A microanalytic inquiry. Language Learning, 54(1), 79117.
Long, M. (1983). Native speaker/non-native speaker conversation and the
negotation of comprehensive input. Applied Linguistics, 5, 177193.
McNamara, T. (2001). Language assessment as social practice: Challenges for
research, Language Testing, 18(4), 333349.
May, L. (2006a). An examination of of rater orientations on a paired candi-
date discussion task through stimulated recall. Melbourne Papers in
Language Testing, 11(1), 2951.
May, L. (2006b). Effective interaction in a paired candidate EAP speaking
test. Paper presented at the 28th Annual Language Testing Research
Colloquium in Melbourne, Australia, July 2006.
Meiron, B. E. (1998). Rating oral proficiency tests: A triangulated study of
rater thought processes. Unpublished masters thesis. University of
California Los Angeles.
Messick, S. (1996). Validity and washback in language testing. Language

Testing 13, 241256.
Morrison, D. M., & Lee, N. (1985). Simulating an academic tutorial: A test
validation study. In Y. P. Lee (Ed.), New directions in language testing
(pp. 8592). Oxford: Pergamon Institute of English.
Norris, J. (2001). Identifying rating criteria for task-based EAP assessment.
In T. Hudson & J. D. Brown (Eds), A focus on language test develop-
ment: Expanding the language proficiency construct across a variety
of tests. Technical report 2, 163204. Honolulu: University of Hawaii,
Second Language Teaching and Curriculum Center.
North, B., & Schneider, G. (1998). Scaling descriptors for language proficiency
scales. Language Testing, 15(2), 217263.
Norton, J. (2005). The paired format in the Cambridge Speaking Tests. ELT
Journal, 59(4), 287297.
O Sullivan, B. (2002). Using observation checklists to validate speaking test
pair tasks. Language Testing, 19(1), 3356.
Perret, G. (1990). The language testing interview: A reappraisal. In J. de Jong &
D. K. Stevenson (Eds.), Individualising the assessment of language abili-
ties (pp. 225228). Philadelphia, PA: Multilingual Matters.
Politt, A., & Murray, N. L. (1996). What do raters really pay attention to?
In M. Milanovic & N. Saville (Eds.), Performance testing, cognition and
assessment: Selected papers from the 15th Language Testing Research
Colloquium (pp. 7491). Cambridge: Cambridge University Press.
Saville, N., & Hargreaves, P. (1999). Assessing speaking in the revised FCE.
ELT Journal, 53, 4251.
Storch, N. (2001). Role relationships in dyadic interactions and their effect
on language uptake. Unpublished Ph D Thesis. The University of
Melbourne, Melbourne.
Swain, M. (2001). Examining dialogue: Another approach to content speci-
fication and to validating inferences drawn from test scores. Language
Testing, 18, 275300.
Taylor, L. (2001). The paired speaking test format: Recent studies. UCLES
Research Notes, Vol. 6, retrieved from http://www.cambridgeesol.org/
rs_notes/rs_nts6.pdf
Turner, C., & Upshur, J. (1996). Developing rating scales for the assessment
of second language performance. In G. Wigglesworth & C. Elder (Eds.),
The language testing cycle: From inception to washback. Australian
Review of Applied Linguistics, Series S (13), 5579.
van Lier, L. (1989). Reeling, writhing, drawling, stretching, and fainting in
coils: Oral proficiency interviews as conversation. TESOL Quarterly, 23,
489508.
Young, R., & Milanovic, M. (1992). Discourse variation in oral proficiency
interviews. Studies in Second Language Acquisition, 14, 403424.

423 Full

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

423 Full

Uploaded by

Copyright:

Available Formats

Language Testing 2009 26 (3) 423 443

Assessing paired orals: Raters

Speaking tasks involving peer-to-peer candidate interaction are increasingly

Keywords: paired interaction construct, rating oral proficiency

The Author(s), 2009. Reprints and Permissions: http://www.sagepub.co.uk/journalsPermissions.nav

oral proficiency testing, through the Cambridge English Exams in

taking account of the bearing interlocutors had on each other during

1 Assessment of peer-to-peer interaction

performance of the learners, it is important also to investigate what

2 Rating scale development and the assessment of peer-to-peer

taking into account that peer-to-peer tasks open up the possibility

II Aim of the study

III Context of the study

Test data Verbal report data Verbal report analysis

was a non-native speaker. The native speakers were from Spain

IV The elicitation task

Tiene 10 minutos en total

Tiene 10 minutos en total

Figure 2 Sample task cards

to examine rater cognition during the rating process (Ericsson &

VI Collection and transcription of verbal reports

performance, the verbal reporting consisted of two distinct but linked

VII Analysis of the verbal report data

as a single or several utterances with a single event or task as the

80% of the total data coded.

1 Non-verbal interpersonal communication

To take an example from the gaze subcategory, raters commented

the interaction to continue. Conversely, if one of the speakers did not

signs of comprehension. Of the two types of interactive listening,

Both turn change and topic cohesion were viewed by raters as

1 Interpersonal non-verbal communication: An interlocutor

2 Interactive listening: An interlocutor in a pair actively

evidence of interactive listening was absent. Thus, listening is

3 Interactional management skills: An interlocutor in a pair

raters orientations. The raters in this study, however, by explicitly

produced by teacher/raters in order to identify the basic components

Green, A. (1998). Verbal Protocol analysis in language testing research:

Messick, S. (1996). Validity and washback in language testing. Language

You might also like