You are on page 1of 30

Spaan Fellow Working Papers in Second or Foreign Language Assessment

Copyright 2010
Volume 8: 130
English Language Institute
University of Michigan
www.lsa.umich.edu/eli/research/spaan
1
Investigating the Construct Validity of a Speaking Performance Test

Hyun Jung Kim
Teachers College, Columbia University


ABSTRACT With the increased demand for the integration of a performance
component in second language (L2) testing, speaking performance assessments
have focused on eliciting examinees underlying language ability through their
actual oral performance on a given task. Considering the nature of performance
assessments, many factors other than examinees speaking ability are
necessarily involved in the process of evaluation. Compared to the construct
definition of speaking ability, however, relatively less attention has been given
to tasks, which are regarded as a vehicle for assessment, although there is a
growing interest in authentic tasks in eliciting real-world language samples for
evaluation. Thus, the present study investigates whether a speaking placement
test provides empirical evidence that the effect of task, as well as examinees
attributes, should be considered in describing speaking ability in a performance
assessment. An understanding of the underlying structure of the speaking
placement test not only helps to identify the factors involved in the evaluation
process and their relationships, but ultimately makes it possible to
appropriately infer examinees speaking ability.


In L2 testing, the notion of performance first emerged in the 1960s in response to
practical needs, and since then, the demand to integrate examinees actual performance in L2
assessment has increased (McNamara, 1996). Early testers who advocated the integration of a
performance component focused on whether examinees could successfully fulfill a task in a
simulated real-life language use context (e.g., Clark, 1975; Jones, 1985; Morrow, 1979;
Savignon, 1972). McNamara (1996) classified this approach as a strong sense of performance
assessment in which the definition of L2 ability construct is limited to examinees task
completion.
On the contrary, new theories of communicative competence and communicative
language ability in the 1980s and 1990s (e.g., Bachman, 1990; Bachman & Palmer, 1996,
Canale, 1983; Canale & Swain, 1980) changed not only the perception of L2 language ability,
but also the role of performance in language testing. They supported a weak sense of
performance assessment (McNamara, 1996), in which the main interest was examinees
language ability, instead of task completion. That is, L2 ability was determined based on
various language components derived from the theoretical models of communicative
competence and communicative language ability. Examinees actual performance was elicited
for evaluation of language ability; however, the role of performance was limited to a vehicle
3 2 H. J. Kim
to elicit examinees underlying language ability. This approach to performance assessment,
called a construct-centered approach (Bachman, 2002), has been widely accepted by L2
testers for most general purpose language performance assessments (e.g., Brindley, 1994;
Fulcher, 2003; Luoma, 2004; McNamara, 1996; Messick, 1994; Skehan, 1998).
While the construct-centered approach to performance assessment gives priority to
definitions of L2 ability, a different perspective has recently been proposed. A task-centered
approach focuses on what examinees can do with the language; that is, whether they can
fulfill a given task (Brown, Hudson, Norris, & Bonk, 2002; Norris, Brown, Hudson, &
Yoshioka, 1998). Although this approach provides more systematic criteria for the evaluation
of examinees task fulfillment than the approach of early testers who first argued for the
integration of performance in language testing, it basically shares the early testers view about
what performance assessments aim to measure (i.e., strong version of performance
assessment). According to the task-centered approach, test contexts or tasks play a crucial role
in measuring L2 ability because examinees performance is evaluated based on real-world
criteria.
The two approaches to performance assessment appear to be contradictory in nature.
Chapelle (1998), however, argued from an interactionalist perspective that both construct
definitions and tasks should be considered together in defining L2 ability because the two
interact during communication. As reviewed, different perspectives on L2 performance
assessment have defined language ability distinctively with a different focus. What is
important is not which approach is superior, but whether a test is validated before inferring
examinees language ability from the test results. In other words, before an inference
regarding an examinees language ability is made from test scores, test developers and users
need to make sure what the test aims to measure (e.g., various language components,
performance on tasks) and whether a test actually measures what it intends to measure.
Although a test is designed for its intended purpose (e.g., following construct definitions, task
characteristics, or both), there are still many factors that need to be considered in L2
performance assessments to understand examinees performance and define their language
ability. Examinees performance may be affected by factors other than their language ability
(McNamara, 1996, 1997). McNamara (1995) elaborated a schematic representation (Figure 1),
which Kenyon (1992) first presented, to conceptualize the performance dimension of L2
speaking performance tests. As presented in the figure, examinees performance in L2
speaking tests is affected by many factors in the testing phase (i.e., candidates, tasks,
interlocutors, and their interactions) as well as in the rating phase (i.e., raters and rating
scales). Empirical studies have identified these factors that affect speaking performance test
scores as effects of the: (1) candidate (Lumley & OSullivan, 2005; OLoughlin, 2002); (2)
task (Chalhoub-Deville, 1995; Clark, 1988; Elder, Iwashita, & McNamara, 2002; Farris, 1995;
Malabonga, Kenyon, & Carpenter, 2005; Shohamy, 1994; Wigglesworth, 1997); (3)
interlocutor (Brown, 2003; OSullivan, 2002); (4) rater (Barnwell, 1989; Bonk & Ockey,
2003; Brown, 1995; Eckes, 2005; Elder, 1993; Y. Kim, 2009; Lumley, 1998; Lumley &
McNamara, 1995; Lynch & McNamara, 1998; Meiron & Schick, 2000; Orr, 2002;
Wigglesworth, 1993); and (5) scale/criteria (M. Kim, 2001). It might be impossible to
completely eliminate the effects of these factors on examinees speaking performance.
However, it is important to understand relative contributions of these factors to examinees
performance and test scores in order to better estimate examinees speaking ability and more
appropriately interpret and use the test results.
3 Investigating the Construct Validity of a Speaking Performance Test
Rater



Scale/Criteria Score



Performance






Interlocutor Task
(including
other
candidate)


Candidate



Figure 1. Interactions in Performance Assessment of Speaking Skills
(McNamara, 1995, p. 173)


To sum up, examinees speaking ability can be inferred only after a test is validated
with respect to its constructs and other factors involved in the process of evaluation. The
focus of previous studies, however, has often been limited to effects of individual factors on
examinees test performance. In other words, speaking performance tests have not been
examined in a big framework in which various factors (e.g., examinees language ability,
tasks, and rating criteria) interact with one another. Moreover, performance tests, especially
those which do not involve high stakes, are oftentimes used without such validation. To this
end, the current study seeks to explore the nature of a speaking placement test, which has
been locally used in a community English program. In order to determine whether the
speaking test accurately measures speaking ability as intended, the underlying structure of the
test is investigated in the present study. In other words, the question of whether the
hypothesized components of speaking ability (reflected in the scoring rubric) actually function
as the operationalized constructs of the test is examined. In addition, to better explain how the
test works, the effects of other variables, such as rater perceptions and task characteristics, are
also investigated. That is, factors that can have an effect on speaking performance are
considered in addition to issues regarding construct definition.

Research Questions

The current study addresses the following three research questions: (1) What is the
factorial structure of the speaking test? (2) To what extent does the speaking test measure the
intended hypothesized constructs of speaking ability? (3) In addition to the measured
variables, to what extent do other factors (i.e., raters and tasks) contribute to examinees
speaking performance?

5 4 H. J. Kim
Method

Context of the Current Study
The Community English Program (CEP) is an English as a second language (ESL)
program offered by the Teaching English to Speakers of Other Languages (TESOL) and
applied linguistics programs at Teachers College. The program targets adult ESL learners who
wish to improve their communicative language ability. Therefore, the CEP curriculum
emphasizes not only the various language components (grammar, vocabulary, and
pronunciation) but also the different language skills (listening, speaking, reading, and writing).
To facilitate effective teaching and learning, all new students of the program are placed into
one of 12 proficiency levels based on results of a placement test, which consists of five
sections (i.e., listening, grammar, reading, writing, and speaking).
A majority of the CEP teachers are MA students of the TESOL and applied linguistics
programs. That is, they are student teachers practicing ESL classroom teaching. Therefore,
their classrooms are regularly observed by faculty and colleagues and follow-up feedback
sessions are provided throughout the semester. The teachers also serve as raters of the writing
and speaking placement tests. From the rating experience, they not only become familiar with
the CEP students writing and speaking ability levels, but they also have an opportunity to
have hands-on experience in evaluating ESL learners writing and speaking ability. Therefore,
the CEP functions as a teacher education program as well as an adult ESL program.

Participants
Participants in the current study consisted of 215 incoming CEP students who took the
CEP speaking placement test. The majority of students in the program were adult immigrants
from the surrounding neighborhood or were family members of international students in the
Columbia University community. The number of female students (73%) far exceeded that of
male students (27%). In terms of the participants first language, a large percentage consisted
of three languages: Japanese (36%), Korean (19%), and Spanish (15%). With regard to their
length of residence, the vast majority of the participants responded that they had been in
English speaking countries, including the United States, for fewer than three years: less than
6 months (40%), 6 months to 1 year (19%), and 1 to 3 years (20%). In terms of their
motivation for studying English, many participants reported academic and job-related reasons,
while over 50 percent gave priority to communication with friends as their reason for
improving their English.

Instruments
The instruments used in the current study included the CEP placement speaking test
and an analytic scoring rubric. The speaking test was designed to measure speaking ability
under various real-life language use situations. The test had six tasks: complaining about a
catering service (Task 1), talking about a favorite movie (Task 2), narrating a story based on a
sequence of pictures (Task 3), refusing a request from a landlord (Task 4), summarizing a
radio commentary (Task 5), and summarizing a lecture (Task 6). The first three tasks (i.e.,
Tasks 1, 2, and 3) were the independent-skills tasks, which required examinees to draw on
their background knowledge to perform the tasks. On the other hand, the last three tasks (i.e.,
Tasks 4, 5, and 6) were the integrated-skills tasks, which required examinees to use their
listening skills in the performance of the tasks. That is, examinees were asked to listen to long
5 Investigating the Construct Validity of a Speaking Performance Test
or short passages, which were provided as part of the tasks, and then formulate responses
based on the content of the passages.
The speaking test was a semi-direct, computer-delivered test. That is, there was no
interaction between an examinee and an interlocutor. Instead, the examinees listened to the
pre-recorded instructions and prompts delivered by a computer and then they were asked to
record their responses. The six tasks and the test format for each task (e.g., preparation time,
response time) are found in Appendix A.
An analytic scoring rubric consisting of five rating scales (see Appendix B) was used
to score the examinees recorded oral responses. The five scales included meaningfulness,
grammatical competence, discourse competence, task completion, and intelligibility. Each of
the five rating scales was rated on a six-point scale (0 for no control to 5 for excellent
control). To analyze each scale in relation to the different tasks in this study, the five scales
for each of the six tasks were regarded as individual items, making a total of 30 items (6 tasks
x 5 rating scales) on the test. That is, each cell in Table 1 illustrates the individual items of the
test. For instance, the item MeanT1 represents meaningfulness for Task 1 while the
itemMeanT2 refers to meaningfulness for Task 2.


Table 1. Taxonomy of Items (Task x Rating Scale) on Speaking Ability
Tasks
Rating scales
Number
of Items
Task 1 Task 2 Task 3 Task 4 Task 5 Task 6
Meaningfulness 6 MeanT1 MeanT2 MeanT3 MeanT4 MeanT5 MeanT6
Grammatical
competence
6 GramT1 GramT2 GramT3 GramT4 GramT5 GramT6
Discourse
competence
6 DiscT1 DiscT2 DiscT3 DiscT4 DiscT3 DiscT6
Task completion 6 TaskT1 TaskT2 TaskT3 TaskT4 TaskT5 TaskT6
Intelligibility 6 IntelT1 IntelT2 IntelT3 IntelT4 IntelT5 IntelT6
Total 30


Procedures
Test Administration
The speaking test was administered in a computer lab on the second day of a two-day
placement test administration. The test was administered to groups of approximately 40
students. Each student was seated in front of a computer. They listened to the test instructions
on a headset, read the instructions on the computer screen, and recorded their responses to the
test items using a microphone. Since all computers were controlled from a central console, the
examinees kept the same pace while taking the test. That is, the instructions and prompts were
delivered at the same time, and the preparation and response times were also provided to all
examinees at the same time.
Before the actual test began, the examinees were asked to fill in a background survey
which asked for demographic information, prior English-learning experience, and plans for
7 6 H. J. Kim
future study. Once all examinees of a group completed the survey, they were given a practice
task so that they would be familiar with the test format. After a short intermission for any
questions about the test format, the six tasks were played in sequence. For each task, the
examinees first listened to or looked at an instruction and a prompt. They were allowed to
prepare responses during a short preparation time and lastly they recorded their responses
during the given response time.

Scoring
Each examinees performance was scored by two independent raters. The raters were
the CEP teachers, most of whom were MA or EdD students in the TESOL and applied
linguistics programs at Teachers College. Prior to the actual rating, the raters attended a
norming session in which the test tasks and the rubric were introduced and sample responses
were provided for practice. Time was also given for discussion of analytic scores so that the
raters had opportunities to monitor their decision-making processes by comparing the
rationale behind their scores with other raters opinions. Rating practice and discussion
continued until the raters felt that they were well aware of the tasks and confident with
assigning scores on different rating scales. Following the norming session, each rater was
assigned a certain number of examinees. Since examinees performance on each of the six
tasks was scored on the five rating scales, each examinee was given 30 analytic ratings on 30
items. The maximum score for each item was five and the minimum was zero. The scores
assigned by two independent raters were later averaged to determine a speaking score for the
placement test.

Analyses
The data were analyzed using SPSS version 12.0 (SPSS Inc., 2001) and EQS version
6.1 (Bentler & Wu, 2005). Descriptive statistics (i.e., means, standard deviations,
maximum/minimum raw scores, and skewness and kurtosis values) were calculated for the
entire test, for the 30 individual items, and for each of the five rating scales across the six
tasks separately using SPSS to verify central tendency and variability. Reliability estimates
were then calculated based on Cronbachs Alpha to examine the degree of relatedness among
the 30 items and the six items under each of the five rating scales. Also, the degree of
agreement between the two raters (i.e., inter-rater reliability) was investigated from various
perspectives, such as from the examinees total score, across the six tasks, and across the five
rating scales. Since composite scores comprised interval data that were converted from the
original ordinal data, inter-rater reliability was estimated based on Pearson Product-Moment
correlations.
After calculating descriptive statistics and reliability estimates, exploratory factor
analyses (EFA) were conducted to determine the extent to which the 30 items clustered
together. In other words, factor analyses were used to examine what patterns of correlations
would be observed among the 30 items. Based on the correlation matrix, initial factors were
extracted by principal-axes factoring (PAF) after the appropriateness of the use of a
correlation matrix for factor analysis was verified using three calculations: (1) Bartletts test
of sphericity; (2) the Kaiser-Meyer-Olkin (KMO); and (3) the determinant of the correlation
matrix. The initial factors were then rotated until the best solution was found to determine the
number of underlying factors. Since it had been assumed that the factors were correlated with
7 Investigating the Construct Validity of a Speaking Performance Test
one another, a direct oblimin rotation procedure was used after checking the factor correlation
matrices each time.
Finally, confirmatory factor analyses (CFA) were performed to establish a model of
the speaking test. CFA was used to determine the extent to which the 30 items were measured
in relation to the six tasks and five scoring criteria. Based on a review of the literature, a
second-order Multitrait-Multimethod (MTMM) Model was first hypothesized. After failing to
find an appropriate solution with the hypothesized model, several other CFA models were
attempted to find a final model that best explained the data. To assess the adequacy of models
including the hypothesized model, several fit indices were used such as the Chi-square
statistic, the Chi-square/df ratio, the comparative fit index (CFI), and the root mean-square
error of approximation (RMSEA). In addition, a distribution of standardized residuals was
checked. The results of the Lagrange Multiplier test and Ward test were analyzed for each run
in order to check any necessary and unnecessary parameters in a model. In the end, however,
a final speaking test model was chosen in accordance with substantive considerations while
taking into account the issue of parsimony. In the process of model evaluation, the ML Robust
method was used each time due to multivariate non-normality of the data.

Results

Descriptive Statistics
The descriptive statistics which were calculated for the item level, the rating scale
level, and the entire 30-item test are presented in Table 2. The item-level means ranged from
2.64 to 3.41 and the standard deviations from 1.01 to 1.57. Although not very different, the
means of grammar-related items (i.e., GramT1 to GramT6) were lower than those for the
other groups of items. On the other hand, task completion-related items (i.e., TaskT1 to Task
T6) showed relatively higher means compared to the other items. Grammar-related items had
the least variability (average Std.=1.04) while task completion-related items had the largest
variability (average Std.=1.23). With regard to the task-related aspect, Task 6 items (i.e.,
MeanT6, GramT6, DiscT6, TaskT6, and IntelT6) had the lowest means under each rating
scale. However, their standard deviations were greatest compared to those for the other task
items under the same rating scale. The skewness and kurtosis values, within the acceptable
range, indicated that all 30 items and five rating scales appeared to be normally distributed.

Reliability Analyses
The reliability estimates for internal consistency were calculated for the five rating
scales and for the entire test (see Table 3). The reliability estimate for the entire test was very
high (0.991), signifying a high degree of homogeneity among the 30 items. Internal
consistency reliability for each rating scale also showed a high degree of consistency of the
six tasks under the five scales. The high reliability estimates, ranging from 0.936 to 0.963,
suggested that the six tasks measured the same construct with a high degree of consistency
within each rating scale.

9 8 H. J. Kim
Table 2. Descriptive Statistics (N=215, K=30)
Variable Minimum Maximum Mean Std. Skewness Kurtosis
1. Meaningfulness (Mean) 0 5.00 3.10 1.14 -.87 .30
MeanT1 0 5.00 3.12 1.28 -.81 .24
MeanT2 0 5.00 3.12 1.18 -.93 .65
MeanT3 0 5.00 3.20 1.11 -.82 .60
MeanT4 0 5.00 3.10 1.30 -.84 -.07
MeanT5 0 5.00 3.20 1.24 -.91 .27
MeanT6 0 5.00 2.85 1.38 -.66 -.45

2. Grammar (Gram) 0 4.58 2.87 1.04 -.95 .52
GramT1 0 5.00 2.83 1.15 -.84 .46
GramT2 0 4.50 2.87 1.04 -1.11 1.16
GramT3 0 4.50 2.92 1.01 -.92 .83
GramT4 0 5.00 2.93 1.20 -.97 .35
GramT5 0 5.00 2.93 1.12 -.92 .46
GramT6 0 4.50 2.73 1.27 -.79 -.29

3. Discourse Competence
(Disc)
0 4.50 2.86 1.08 -.91 .33
DiscT1 0 5.00 2.83 1.21 -.78 .17
DiscT2 0 5.00 2.84 1.09 -.91 .59
DiscT3 0 5.00 2.95 1.05 -.83 .75
DiscT4 0 5.00 2.93 1.26 -.85 -.04
DiscT5 0 5.00 2.97 1.18 -.84 .17
DiscT6 0 5.00 2.64 1.32 -.63 -.42

4. Task Completion (Task) 0 5.00 3.18 1.23 -.86 .08
TaskT1 0 5.00 3.07 1.41 -.52 -.44
TaskT2 0 5.00 3.36 1.33 -1.00 .32
TaskT3 0 5.00 3.41 1.23 -.92 .42
TaskT4 0 5.00 3.04 1.57 -.56 -1.01
TaskT5 0 5.00 3.37 1.41 -.82 -.25
TaskT6 0 5.00 2.86 1.48 -.52 -.73

5. Intelligibility (Intel) 0 4.92 3.02 1.09 -.92 .53
IntelT1 0 5.00 2.99 1.20 -.89 .50
IntelT2 0 5.00 3.00 1.15 -.96 .73
IntelT3 0 5.00 3.09 1.05 -.93 .86
IntelT4 0 5.00 3.07 1.25 -.90 .20
IntelT5 0 5.00 3.10 1.16 -.91 .67
IntelT6 0 5.00 2.89 1.30 -.75 -.16

Total (30 items) 0 4.73 3.01 1.10 -.94 .43
9 Investigating the Construct Validity of a Speaking Performance Test
Table 3. Reliability Estimates (N=215)
Construct Items Used
Nr of
Items
Reliability Estimates
Meaningfulness MeanT1 - MeanT6 6 0.960
Grammatical Competence GramT1 - GramT6 6 0.963
Discourse Competence DiscT1 - DiscT6 6 0.958
Task Completion TaskT1 - TaskT6 6 0.936
Intelligibility IntelT1 - IntelT6 6 0.963
Total 30 0.991


Although average scores by the two raters were used for the statistical analyses, inter-
rater reliability was calculated to determine the degree of agreement between the two raters.
The correlation between Rater 1 and Rater 2 was 0.837 for examinees total score (see Table
4), 0.71 to 0.80 across the six tasks (see Table 5), and 0.78 to 0.82 across the five rating scales
(see Table 6). All correlations were significant at the alpha = 0.01 level, indicating that the
first raters score on each task, each rating scale, and entire test significantly correlated with
the second raters score on the same task, rating scale, and entire test. As a result, it can be
assumed that the two raters scored the examinees speaking with similar criteria in mind.


Table 4. Inter-rater Reliability for the Entire Speaking Test (N = 215)
Rater 1 (TotR1) Rater 2 (TotR2)
Rater 1 (TotR1) 1.00 0.837**
Rater 2 (TotR2) 0.837** 1.00
**p < 0.01 (2-tailed), R1 = Rater 1, R2 = Rater 2


Table 5. Inter-rater Reliability across Six Tasks (N = 215)
T1R1 T1R2 T2R1 T2R2 T3R1 T3R2 T4R1 T4R2 T5R1 T5R2 T6R1 T6R2
T1R1 1.00 0.80**
T1R2 1.00
T2R1 1.00 0.75**
T2R2 1.00
T3R1 1.00 0.71**
T3R2 1.00
T4R1 1.00 0.81**
T4R2 1.00
T5R1 1.00 0.80**
T5R2 1.00
T6R1 1.00 0.80**
T6R2 1.00
**p < 0.01 (2-tailed), T1T6: Task 1Task 6; R1 = Rater 1, R2 = Rater 2

11 10 H. J. Kim
Table 6. Inter-rater Reliability across the Five Constructs (N = 215)
MR1 MR2 GR1 GR2 DR1 DR2 TR1 TR2 IR1 IR2
MR1 1.00 0.78**
MR2 1.00
GR1 1.00 0.82**
GR2 1.00
DR1 1.00 0.80**
DR2 1.00
TR1 1.00 0.80**
TR2 1.00
IR1 1.00 0.81**
**p < 0.01 (2-tailed), M: Meaningfulness, G: Grammatical Competence, D: Discourse
Competence,
T: Task Completion, I: Intelligibility,
R1 = Rater 1, R2 = Rater 2


Results of Exploratory Factor Analysis
Once the appropriateness of the use of a correlation matrix for factor analysis was
verified (e.g., a significant Chi-square, the positive determinant of the correlation matrix), an
EFA was conducted as a preliminary step for a CFA in order to develop a factor structure for
the 30 observed variables. The initial factor extraction showed a very different result from the
hypothesized design of speaking ability, which assumed five underlying factors (i.e., five
rating scales). Two factors with eigenvalues greater than 1.0 were extracted, which accounted
for 83.7 percentof the variance. Variable communalities were all above 0.7, specifying that
the variances of the variables accounted for by the common factors were very high. The scree
plot also suggested the extraction of two factors. Since the number of factors obtained from
the initial extraction was quite different from the hypothesis set for the speaking test, solutions
with different numbers of factors were compared. The three factor oblique rotation was the
best solution to achieve maximum parsimony (see Table7). As observed in Table 7, the 30
items used to measure speaking ability clustered around the type of task. For instance, items
for Tasks 1, 2, and 3 loaded on Factor 1, items for Task 6 loaded on Factor 2, and items for
Tasks 4 and 5 loaded on Factor 3. To illustrate, all five items for Task 6 (i.e., MeanT6,
GramT6, DiscT6, TaskT6, and IntelT6) showed factor loadings above 0.3 for Factor 2.
Further analysis of the six tasks revealed a possible reason as to why the items
clustered around the task type factors rather than around the rating scales. Since Tasks 1, 2,
and 3 required examinees to speak with the minimal input, the factor on which the items for
these three tasks loaded was interpreted as a Speak factor. Contrary to Tasks 1, 2, and 3,
Tasks 4 and 5 first required examinees to listen to a long message and then respond or
summarize it. Thus, Factor 3, which included items for Tasks 4 and 5, was coded as a Listen
and Speak factor. While Task 6 was a summary task (as was Task 5), it appeared that Task 6
required examinees to have topical knowledge in the process of listening and summarizing a
message. That is, examinees familiarity with the topic of the task could help them approach
the task easily. Whereas Task 5 was about a topic (an electric car) that might be more
commonly discussed in everyday life, the listening prompt provided in Task 6 was a lecture
with highly specified content (the Barbizon School). Thus, Factor 2 was coded as Listen and
11 Investigating the Construct Validity of a Speaking Performance Test
Speak with Topical Knowledge. In sum, the items did not cluster around operationalized
constructs of speaking ability (i.e., rating scales), showing that examinees speaking
performance was better explained according to the task type rather than to the hypothesized
five constructs of speaking ability. As a result, the two cross-loadings present (i.e., IntelT3
and GramT5) were not seen as problematic since grammar and intelligibility could be
involved in any task as long as factors were divided based on the task type. The final three-
factor solution is presented in Table 8.


Table 7. Pattern Matrix for Speaking Ability
Factor

1 2 3
DiscT2 1.015 .054 .171
GramT2 .944 .142 .163
TaskT2 .890 .017 .014
GramT1 .874 .027 -.037
IntelT1 .873 -.019 -.054
DiscT1 .856 -.006 -.061
IntelT2 .852 .139 .069
MeanT2 .847 .135 .037
MeanT1 .837 -.052 -.138
TaskT3 .795 -.090 -.138
GramT3 .764 -.020 -.207
DiscT3 .720 -.030 -.236
MeanT3 .702 .020 -.212
TaskT1 .670 .028 -.190
IntelT3 .585 .015 -.337
MeanT6 .011 .962 -.004
TaskT6 -.044 .933 -.063
DiscT6 .056 .918 -.014
GramT6 .094 .847 -.053
IntelT6 .033 .804 -.143
MeanT4 .076 -.004 -.897
GramT4 .108 .058 -.809
TaskT4 -.057 .125 -.791
IntelT4 .090 .120 -.769
DiscT4 .135 .100 -.741
TaskT5 .066 .253 -.634
IntelT5 .189 .205 -.584
MeanT5 .224 .182 -.573
DiscT5 .288 .133 -.564
GramT5 .319 .179 -.484
Extraction Method: Principal Axes Factoring.
Rotation Method: Oblimin with Kaiser Normalization.
a Rotation converged in 13 iterations.
13 12 H. J. Kim
Table 8. Revised Taxonomy of Speaking Ability (Based on Exploratory Factor Analysis)
Factors
Nr of
Items
Items
Speak 15
Task 1
Task 2
Task 3
5
5
5
MeanT1, GramT1, DiscT1, TaskT1, IntelT1
MeanT2, GramT2, DiscT2, TaskT2, IntelT2
MeanT3, GramT3, DiscT3, TaskT3, IntelT3
Listen & Speak with
Topical Knowledge
5
Task 6 5 MeanT6, GramT6, DiscT6, TaskT6, IntelT6
Listen & Speak 10
Task 4
Task 5
5
5
MeanT4, GramT4, DiscT4, TaskT4, IntelT4
MeanT5, GramT5, DiscT5, TaskT5, IntelT5
Total 30


Results of Confirmatory Factor Analysis
Bachman (2002) argued that a language test should be designed taking task
characteristics into account as well as the construct definition of language ability in order to
achieve the intended purpose of the test. In an attempt to understand the speaking test of the
current study in terms of both aspects (i.e., construct definition and task characteristics), the
first MTMM model was hypothesized in which the 24 items loaded on both trait factors (i.e.,
the four rating scales) and method factors (i.e., the six tasks), while the four trait factors
loaded on a second-order factor, speaking ability (see Figure 2). The rating scale of task
completion was not included as a trait factor in the model since it was considered redundant in
relation to the other rating scales. As a result, six items related to task completion (i.e.,
TaskT1, TaskT2, TaskT3, TaskT4, TaskT5, TaskT6) were deleted for the analysis, making a
total of 24 observed variables. Moreover, correlations among six tasks were not established in
the first model because six different tasks were hypothesized to elicit different aspects of
speaking ability.
In order to respecify the first model, several attempts were made. First, it was tested
whether four first-order factors (i.e., four trait factors) would load on the second-order factor
(i.e., speaking ability) without any method factors (see Figure 3). The data did not fit the
model, which indicated problems similar to those of the first model (e.g., condition codes and
factor loadings above 1.0). Moreover, the model showed a very poor fit, with a CFI of 0.715
and a RMSEA of 0.162. The results confirmed a need for consideration of both construct (i.e.,
rating scales) and task to interpret test scores, since the model without the task factors did not
represent the data. In addition, based on the results of this model, it was decided that four trait
factors should be correlated instead of using of a second-order factor. The model-fit
evaluation of the hypothesized model indicated an excellent fit, showing the very high CFI
(0.99) and the very low RMSEA (0.032 with the confidence interval [0.015, 0.044]). In terms
of fit indices, the model was ideal since the CFI above 0.95 and the RMSEA below 0.05 are
considered an indication of a well-fitting model (Byrne, 2006). However, the test results were
not reliable due to a condition code for a variance of factor error (Parameter: D2, D2) which
caused an improper solution (e.g., the greater than 1.0 factor loading for Grammatical
13 Investigating the Construct Validity of a Speaking Performance Test
Competence). Such a condition code, which is a common occurrence with MTMM data,
might have occurred due to the complexity of model specification (Byrne, 2006). Thus, the
initially hypothesized model was rejected.





Figure 2. The Hypothesized Second-Order MTMM Model of CEP Speaking Placement Test
Mean: Meaningfulness, Gram: Grammatical Competence, Disc: Discourse Competence,
Intel: Intelligibility, T1T6: Task 1Task 6

15 14 H. J. Kim



Figure 3. The Second-order Model without Method Factors
Mean: Meaningfulness, Gram: Grammatical Competence, Disc: Discourse Competence,
Intel: Intelligibility, T1T6: Task 1Task 6


Another attempt was made before deciding upon a final model. A model was tested
with two additional factors: Rating 1 and Rating 2. The model was run both with and without
the correlation between the two ratings. However, both models were unsuccessful, which
confirmed that the data were not explained with such models. Therefore, based on an
examination of several possible models, the final MTMM model was established with four
trait factors which were correlated with each other and six method factors (see Figure 4). This
final model was obtained after statistically testing two assumptions which were made in
advance. The first assumption regarding the deletion of task completion factor was confirmed
since the inclusion of task completion factor to the final model lowered the overall fit of the
data. To test the other assumption related to possible task effect, the final MTMM model was
15 Investigating the Construct Validity of a Speaking Performance Test
also tested with correlations among six method factors. Although the overall fit increased, it
showed very little improvement. Thus, it was concluded that six different tasks measured
different aspects of speaking ability so that the correlations were not included in the final
model. Though all estimates were statistically significant, they were not included in Figure 4
since they were not legible with the overabundance of arrows (Refer to Table 10 for the
estimates).
As shown in Figure 4, there were 24 dependent variables (i.e., 24 observed variables)
and 34 independent variables (i.e., 10 factors and 24 error terms). There were also 78
parameters (i.e., 48 factor loadings, 6 factor covariances, 24 error variances) and 34 fixed
nonzero parameters (i.e., 10 factor variances, 24 error regression paths). The structure of these
factors and variables as specified in the model was tested based on the covariance matrix.
Following the summary of the model, model identification was confirmed in the output.
The model was first assessed as a whole. In terms of residuals, off-diagonal elements
were examined since they play a major role in the effect of Chi-square statistics. The
standardized residual values were evenly distributed, and the average off-diagonal absolute
standardized residual was also quite small, at 0.0156. In addition, the distribution of
standardized residuals was symmetric and centered around zero. As a result, it was found that
very little discrepancy existed between S(q) (covariance matrix implied by the specified
structure of the hypothesized model) and S (sample covariance matrix of observed variable
scores). With regard to the goodness of fit statistics, the independence Chi-square statistic was
5189.140 with 276 degrees of freedom. Although the Chi-square/df ratio was much greater
than 2, implying a poor model-data fit, it was ignored due to Chi-square sensitivity to sample
size. Instead, fit indices were used for further model-fit evaluation (see Table 9).

Table 9. EQS Output Goodness of Fit Statistics
GOODNESS OF FIT SUMMARY FOR METHOD = ROBUST

ROBUST INDEPENDENCE MODEL CHI-SQUARE = 5189.140 ON 276 DEGREES OF FREEDOM
INDEPENDENCE AIC = 4637.140 INDEPENDENCE CAIC = 3430.844
MODEL AIC = -179.250 MODEL CAIC = -1149.532

SATORRA-BENTLER SCALED CHI-SQUARE = 264.7500 ON 222 DEGREES OF FREEDOM
PROBABILITY VALUE FOR THE CHI-SQUARE STATISTIC IS 0.02603

FIT INDICES

BENTLER-BONETT NORMED FIT INDEX = 0.949
BENTLER-BONETT NON-NORMED FIT INDEX = 0.989
COMPARATIVE FIT INDEX (CFI) = 0.991
BOLLEN'S (IFI) FIT INDEX = 0.991
MCDONALD'S (MFI) FIT INDEX = 0.905
ROOT MEAN-SQUARE ERROR OF APPROXIMATION (RMSEA) = 0.030
90% CONFIDENCE INTERVAL OF RMSEA (0.011, 0.043)


17 16 H. J. Kim


Figure 4. The Final MTMM Model
Mean: Meaningfulness, Gram: Grammatical Competence, Disc: Discourse Competence,
Intel: Intelligibility, T1 T6: Task 1 Task 6; F1 F10: Factors 1 Factor 10; V2 V31:
Observed Variables 2 31
17 Investigating the Construct Validity of a Speaking Performance Test
As shown in Table 9, the CFI was 0.991 and the RMSEA was 0.03 with the confidence
interval [0.011, 0.043], both of which indicated an excellent fit. The final indicator of overall
model fit was the number of iterations. According to the iterative summary in the output, only
five iterations were needed to reach convergence, which meant that the data fit the model
relatively easily. Thus, it was revealed from the analyses of residuals and fit indices that the
current 24 data fit the 10 factor MTMM model well as a whole.
After confirming the good fit of the model as a whole, the fit of individual parameters
was also assessed. The statistical significance of parameter estimates was first checked based
on the unstandardized estimates. All parameter estimates were statistically significant.
Therefore, all parameters could be considered important to the model, and none of the
parameters needed to be deleted from the model. Following the unstandardized estimates, a
standardized solution was considered (see Table 10).
As shown in Table 10 (next page), the trait factor loadings (i.e., F1 to F4), ranging
from 0.849 to 0.926, were much higher than method factor loadings (i.e., F5 to F10), ranging
from 0.265 to 0.466. This signified that the four traits (i.e., rating scales) were much stronger
indicators than the six tasks, although both needed to be considered. Since the regression
coefficients of errors were quite small, ranging from 0.207 to 0.309, it can be concluded that
the contribution of errors to the variables was low and the variables were mainly explained by
the factors. All of the very high R-squared values, which refer to the proportion of variance
accounted for by its related factors, confirmed that all 24 items explained the model fairly
well. Moreover, as assumed above, correlations between the trait factors were quite high at
around 0.98. The four factors were all operationalized constructs of a single construct of
speaking ability. However, extremely high correlations were not considered ideal for analytic
scoring since they indicated that four rating scales were almost indistinguishable.

Discussion and Conclusion

The present study examined the underlying structure of the CEP speaking placement
test based on a confirmatory factor analysis. The analysis was conducted with four trait
factors (i.e., meaningfulness, grammatical competence, discourse competence, and
intelligibility) and six method factors (i.e., Tasks 1 to 6). Also, the four trait factors were
correlated with one another. Although these four traits were assumed to be related by virtue of
being aspects of the same ability, correlations over 0.90 were unexpected. These high
correlations may indicate that speaking ability cannot be separated into several analytic
aspects, or the raters failed to understand and differentiate among the analytic scoring criteria.
For example, raters may have given similar scores to the four rating scales of each task based
on their own impression rather than going over the different criteria carefully, or they may not
have been accustomed to the different criteria because of the short norming period. Further
research on raters rating processes may be required to explain the relationship among these
components of speaking ability.



19 18 H. J. Kim
Table 10. EQS Output Standardized Solution
ST/\D/lDT/lD SO|TTO\: lSO|/llD

Hl/\T! \? .888+l! .39+l .?9 l? .933
0l/HT! \3 .898+l? .318+l .?b9 l3 .9?8
DTSCT! \1 .8b+l3 .39!+l .?8? l1 .9?!
T\TlT! \b .880+l1 .3!+l .?9 lb .9!3
Hl/\T? \ .9!+l! .301+lb .?bb l .9?9
0l/HT? \8 .89!+l? .3b8+lb .?bb l8 .9?9
DTSCT? \9 .89+l3 .3b?+lb .309 l9 .901
T\TlT? \!! .903+l1 .3!?+lb .?9 l!! .9!?
Hl/\T3 \!? .90b+l! .?9+l .300 l!? .9!0
0l/HT3 \!3 .9??+l? .?!+l .?8 l!3 .9?3
DTSCT3 \!1 .900+l3 .318+l .?b! l!1 .93?
T\TlT3 \!b .9!b+l1 .?b+l .30? l!b .909
Hl/\T1 \! .903+l! .30+l8 .??! l! .9!
0l/HT1 \!8 .90+l? .33!+l8 .?8 l!8 .933
DTSCT1 \!9 .909+l3 .318+l8 .?3! l!9 .91
T\TlT1 \?! .9?!+l1 .3?1+l8 .?!b l?! .93
Hl/\T \?? .9?!+l! .3?9+l9 .?09 l?? .9b
0l/HT \?3 .9?1+l? .?+l9 .?81 l?3 .9!9
DTSCT \?1 .9?b+l3 .?9b+l9 .?31 l?1 .91
T\TlT \?b .9?b+l1 .?93+l9 .?3 l?b .911
Hl/\Tb \? .819+l! .1bb+l!0 .?18 l? .939
0l/HTb \?8 .880+l? .1??+l!0 .?!8 l?8 .9?
DTSCTb \?9 .8b+l3 .11+l!0 .?0 l?9 .9
T\TlTb \3! .8b+l1 .391+l!0 .?b l3! .9?1

COlll/TTO\S /HO\0 T\Dlll\Dl\T \/lT/blS
\ l
T l? l? .99?+T
T l! l! T
T T
T l3 l3 .98+T
T l! l! T
T T
T l1 l1 .9!+T
T l! l! T
T T
T l3 l3 .99b+T
T l? l? T
T T
T l1 l1 .98+T
T l? l? T
T T
T l1 l1 .9!+T
T l3 l3 T
T T

19 Investigating the Construct Validity of a Speaking Performance Test
The final MTMM model explained the current test data very well, as evidenced by the
high fit indices. In particular, the four operationalized constructs (i.e., four rating scales)
primarily explained the data with higher factor loadings than the six tasks. In other words,
examinees performance on the test was mainly explained by the four constructs of speaking
ability; however, the characteristics of the six tasks had a non-negligible effect on the
examinees performance. Therefore, the results of the current study empirically supported the
interactionalist perspective in which examinees speaking ability is determined in terms of
both constructs (traits) and task characteristics of the test.
Although the current study contributes to the recent discussion concerning the
importance of both construct definitions and test task characteristics in L2 performance
assessments, it has a number of limitations. First, due to a limited sample size, it was not
possible to include a rating factor as part of the underlying structure of the speaking test
although multiple ratings were available for all examinees responses. It has been argued that
raters are the one of the factors that affects examinees performance (Kenyon, 1992; Linacre,
1989; McNamara, 1995, 1996, 1997). Indeed, previous studies on raters, which analyzed
raters rating behaviors both quantitatively and qualitatively, showed rater effects on
performance assessments (e.g., Bonk & Ockey, 2003; Brown, 2005; Chalhoub-Deville, 1995;
Eckes, 2005; Meiron & Schick, 2000; Orr, 2002). Therefore, inclusion of a rater/rating factor
might change the underlying structure of the speaking test.
The other limitation is that structural equation modeling is a data-specific statistical
tool. In other words, the results of the current analyses cannot be generalized to other CEP
speaking data which include different participants. Likewise, other data sets might be
explained with different factors or different factorial structures. Therefore, in order to
generalize the structure of CEP speaking placement test, repeated analyses of test data with a
larger sample size are required across different test administrations. Only then can the nature
of the CEP speaking placement test be understood and, ultimately, can inferences made on
examinees speaking ability be considered reliable.


Acknowledgements

I would like to express my appreciation to the English Language Institute at the
University of Michigan for giving me an opportunity to perform this research. I am also very
grateful to Professor James Purpura and my colleagues at Teachers College, for their
insightful comments and suggestions throughout this study.

References

Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford: Oxford
University Press.
Bachman, L. F. (2002). Some reflections on task-based language performance assessment.
Language Testing, 19(4), 453476.
Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice: Designing and
developing useful language tests. Oxford: Oxford University Press.
Barnwell, D. (1989). Naive native speakers and judgments of oral proficiency in Spanish.
Language Testing, 6(2), 152163.
21 20 H. J. Kim
Bentler, P. M., & Wu, E. (2005). EQS 6.1 for windows users guide. Encino, CA: Multivariate
Software, Inc.
Bonk, W. J., & Ockey, G. J. (2003). A many-facet Rasch analysis of the second language
group oral discussion task. Language Testing, 20(1), 89110.
Brindley, G. (1994). Task-centred assessment in language learning: The promise and the
challenge. In N. Bird, P. Falvey, A. Tsui, D. Allison, & A. McNeill (Eds.), Language
and learning: Papers presented at the Annual International Language in Education
Conference (Hong Kong, 1993) (pp. 7394). Hong Kong: Hong Kong Education
Department.
Brown, A. (1995). The effect of rater variables in the development of an occupation-specific
language performance test. Language Testing, 12(1), 115.
Brown, A. (2003). Interviewer variation and the co-construction of speaking proficiency.
Language Testing, 20(1), 125.
Brown, A. (2005). Interviewer variability in oral proficiency interviews. Frankfurt, Germany:
Peter Lang.
Brown, J. D., Hudson, T., Norris, J. M., & Bonk, W. (2002). An investigation of second
language task-based performance assessments. Honolulu: University of Hawaii Press.
Byrne, B. M. (2006). Structural equation modeling with EQS. Mahwah NJ: Lawrence
Erlbaum Associates, Inc.
Canale, M. (1983). On some dimensions of language proficiency. In J. W. Oller, Jr. (Ed.),
Issues in language testing research (pp. 333342). Rowley, MA: Newbury House.
Canale, M., & Swain, M. (1980). Theoretical bases of communicative approaches to second
language teaching and testing. Applied Linguistics, 1(1), 147.
Chalhoub-Deville, M. (1995). Deriving oral assessment scales across different tests and rater
groups. Language Testing, 12(1), 1633.
Chapelle, C. (1998). Construct definition and validity inquiry in SLA research. In L. F.
Bachman & A. D. Cohen (Eds.), Interfaces between second language acquisition and
language testing research (pp. 3270). Cambridge: Cambridge University Press.
Clark, J. L. D. (1975). Theoretical and technical considerations in oral proficiency testing. In
R. L. Jones, & B. Spolsky (Eds.), Testing language proficiency (pp. 1028). Arlington,
VA: Center for Applied Linguistics.
Clark, J. L. D. (1988). Validation of a tape-mediated ACTFL/ILR-scale based test of Chinese
speaking proficiency. Language Testing, 5(2), 187205.
Eckes, T. (2005). Examining rater effects in TestDaF writing and speaking performance
assessments: A many-facet Rasch analysis. Language Assessment Quarterly, 2(3), 197
221.
Elder, C. (1993). How do subject specialists construe classroom language proficiency?
Language Testing, 10(3), 235254.
Elder, C., Iwashita, N., & McNamara, T. (2002). Estimating the difficulty of oral proficiency
tasks: What does the test-taker have to offer? Language Testing, 19(4), 347368.
Farris, C. S. (1995). A semiotic analysis of sajiao as a gender marked communication style in
Chinese. In M. Johnson & F. Y. L. Chiu (Eds.), Unbound Taiwan: Close-ups from a
distance. Selected Papers Vol. 8 (pp. 129). Chicago: Center for East Asian Studies,
University of Chicago.
Fulcher, G. (2003). Testing second language speaking. London: Longman.
21 Investigating the Construct Validity of a Speaking Performance Test
Jones, R. L. (1985). Second language performance testing: An overview. In P. C. Hauptman,
R LeBlanc, & M. B. Wesche (Eds.), Second language performance testing (pp. 1524).
Ottawa: University of Ottawa Press.
Kenyon, D. M. (1992). Introductory remarks at symposium on Development and use of rating
scales in language testing, 14th Language Testing Research Colloquium, Vancouver,
February 27th March 1st.
Kim, M. (2001). Detecting DIF across the different language groups in a speaking test.
Language Testing, 18(1), 89114.
Kim, Y. (2009). An investigation into native and non-native teachers judgments of oral
English performance: A mixed methods approach. Language Testing, 26(2), 187217.
Linacre, J. M. (1989). Many-facet Rasch measurement. Chicago: MESA Press.
Lumley, T. (1998). Perceptions of language-trained raters and occupational experts in a test of
occupational English language proficiency. English for Specific Purposes, 17, 34767.
Lumley, T., & McNamara, T. F. (1995). Rater characteristics and rater bias: Implications for
training. Language Testing, 12(1), 5471.
Lumley, T., & OSullivan, B. (2005). The effect of test-taker gender, audience and topic on
task performance in tape-mediated assessment of speaking. Language Testing, 22(4),
415437.
Luoma, S. (2004). Assessing speaking. Cambridge: Cambridge University Press.
Lynch, B. K., & McNamara, T. F. (1998). Using G-theory and many-facet Rasch
measurement in the development of performance assessments of the ESL speaking skills
of immigrants. Language Testing, 15(2), 158180.
Malabonga, V., Kenyon, D. M., & Carpenter, H. (2005). Self-assessment, preparation and
response time on a computerized oral proficiency test. Language Testing, 22(1), 5992.
McNamara, T. F. (1995). Modelling performance: Opening pandoras box. Applied
Linguistics, 16(2), 159179.
McNamara, T. F. (1996). Measuring second language performance. London: Longman.
McNamara, T. F. (1997). Interaction in second language performance assessment: Whose
performance? Applied Linguistics, 18(4), 446466.
Meiron, B., & Schick, L. (2000). Ratings, raters and test performance: An exploratory study.
In A. J. Kunnan (Ed.), Fairness and validation in language assessment. Selected papers
from the 19
th
Language Testing Research Colloquium, Orlando, Florida (pp. 6081).
Cambridge: Cambridge University Press.
Messick, S. (1994). The interplay of evidence and consequences in the validation of
performance assessments. Educational Researcher, 23(2), 1323.
Morrow, K. (1979). Communicative language testing: Revolution or evolution? In C. J.
Brumfit, & K. Johnson (Eds.), The communicative approach to language teaching (pp.
143157). Oxford: Oxford University Press.
Norris, J. M., Brown, J. D., Hudson, T., & Yoshioka, J. (1998). Designing second language
performance assessments (Technical Report No. 18). Honolulu: University of Hawaii,
Second Language Teaching & Curriculum Center.
OLoughlin, K. K. (2002). The impact of gender in oral proficiency testing. Language Testing,
19(2), 169192.
Orr, M. (2002). The FCE speaking test: Using rater reports to help interpret test scores.
System, 30, 143154.
23 22 H. J. Kim
Savignon, S. J. (1972). Communicative competence: An experiment in foreign language
teaching. Philadelphia: The Center for Curriculum Development.
Shohamy, E. (1994). The validity of direct versus semi-direct oral tests. Language Testing,
11(2), 99123.
Skehan, P. (1998). A cognitive approach to language learning. Oxford: Oxford University
Press.
SPSS Inc. (2001). SPSS Base 12.0 for Windows[Computer Software]. Chicago IL: SPSS Inc.
Wigglesworth, G. (1993). Exploring bias analysis as a tool for improving rater consistency in
assessing oral interaction. Language Testing, 10(3), 305335.
Wigglesworth, G. (1997). An investigation of planning time and proficiency level on oral test
discourse. Language Testing, 14(1), 85106.
23 Investigating the Construct Validity of a Speaking Performance Test
Appendix A. Speaking Test Tasks

Task 1. Catering service
In this task, you need to complain about something. Imagine you have ordered food from
Party Planners Inc. for your bosss birthday party. But there was not enough food and it was
delivered late. You spent a week planning the party, but it was ruined because of the food.
You were extremely upset that it happened. Call the caterer to complain about it. You have 20
seconds to plan.

Prompt (Audio)
[phone ringing] (Answering Machine) Hi! Youve reached Party Planners Inc. Were sorry,
but were not available to take your call right now. Please leave a detailed message after the
beep, and well get back to you as soon as possible. [Beep]
Test-Taker: (45 sec response time)


Task 2. Favorite movie
In this task, you will be asked to talk about a movie. Think about a movie that you liked and
tell your friend about it. You have 20 seconds to plan.

Prompt (Vidio)
Your friend: So, what was that movie you liked? What is it about?
Test-Taker: (60 sec response time)


Task 3. Fly in soup
In this task, you need to tell the story in the pictures. Look at the pictures (Pictures are shown
on the screen). Imagine this happened yesterday while you were having dinner at the next
table. Tell your friend what you saw. You have 60 seconds to plan your response.

Prompt (Video)
Your friend: So, what happened last night at the restaurant?
Test-taker: (60 sec response time)

25 24 H. J. Kim
Task 4. Moving out
In this task, you need to refuse a request. Imagine you are renting an apartment from a nice
old couple in New York City. You have been living there for over a year. Now, listen to a
telephone message from the couple.
Hi, this is Mary, your landlady. Tom and I have been trying to contact you, but you never seem
to be home. I guess you're really busy these days. Anywaywell, I don't know how to say this,
butour granddaughter is moving to the City next month. She's gonna study at Columbiaand,
as you know, living in the city is expensive, and the rents are really high. So, she asked us if she
could live in the apartment you have now. I know we just renewed your lease, and we have no
right to ask you to move out, and, we really like you, too. But, do you think you can possibly
look for a different apartment? We're really sorry about this, but we have to do this for our
granddaughter. Since theres not much time, we'd like to hear from you as soon as possible, so
we can let our granddaughter know too. Again, we're sorryCall and let us know, ok? Thanks.
(162 words)

(Q) Politely tell your landlady that you cant move out and explain why. You have 30 seconds
to plan.

Prompt (Audio)
Landlady: Hi. Come on in. Did you get our message? Have you thought about moving out?
Test-taker: (45 sec response time)


Task 5. Electric cars
In this task, you will be asked to summarize a radio commentary for a friend. Imagine your
friend, Jim is thinking about buying an electric car. Now, listen to the radio commentary.
(Host of the radio commentary) Today, were talking about electric cars. As youre well
aware, the conventional cars we drive everydayuse a lot of gasoline. You know, how the
price of gasoline is going upand more importantly, theres the issue of global warming
these cars release harmful pollutants, like carbon monoxide. So, in reaction to this, engineers
have been working on cars that run on electric batteries, so lets hear about the current state
of the technology. We have a pre-recorded commentary by Ben Smith from General Autos.
Well, despite high expectations, the first generation of electric cars turned out to be a
complete failure. Why? The first problem is the batteryI mean, current battery technology is
still very limited. So electric cars can only travel a short distance before its battery needs
recharging. What this means is you cant make long trips without worrying about the battery
running out. Theyre only good for short trips like going to the supermarket or picking up the
kids from school. And when you turn the air conditioner or the radio on, the battery is used up
even quicker.
Then, you might say, we can just recharge the battery when its used up.
Welltheres a serious problem with recharging, too. To recharge a battery, we need an
electric outlet, right? But there arent many charging stationswhich means, the driver might
get stuckwithout being able to find a charging station nearby. Well, it gets even more
frustrating. Even if you can find a station, it takes up to 3 hours to fully recharge a battery. Its
way too long. Well, with these many limitations, does it make sense that anyone would want
to buy an electric car, even if it is environmentally friendly?
25 Investigating the Construct Validity of a Speaking Performance Test
(Q) Summarize what you heard on the radio for Jim. Be sure to include two main problems
with electric cars. You have 30 seconds to plan.

Prompt (Video)
Jim: Did I tell you Im thinking about buying an electric car?
Test-taker: (60 sec response time)


Task 6. Barbizon school
In this task, you will be asked to summarize a lecture for a classmate. Imagine your classmate,
Jennifer missed todays lecture about the Barbizon school. Now, listen to the lecture.
Today, well talk about a group of artists, called the Barbizon School. The Barbizon School is
a group of French artists, who lived in the French town, Barbizon and who developed the
genre of landscape painting. So, what are their characteristics?

The Barbizon painters tried to find comfort in nature. I mean, they moved away from all the
commotion and disruption happening in, then, revolutionary Paris, and sought solace in
nature. And nature was the main theme of their paintingsthey painted landscapes and
scenes of rural life as true to life as possible. And they rejected the idea of manipulating or
beautifying nature. Instead, they tried to achieve a true representation of the countryside. OK?

Second, in addition to the efforts to paint nature as realistically as possible, they also tried to
establish landscape as an independent, legitimate genre in France. Traditionally, landscape
painting wasnt appreciated as a separate genre, but only considered as a background. But
Barbizon artists reacted against this convention of classical landscape, and painted landscape
for its own sake. With their huge success and recognition, the painters of the Barbizon school
established landscape and themes of country life as vital subjects for French artists.

Now, lets look at an examplea painting by Rousseau. This one is called The Forest in
Winter at Sunset. [Show the painting on screen]. It shows the ancient forest near the village
of Barbizon. Rousseau is the best known member of the group. Each Barbizon painter had his
own style and specific interests, and Rousseaus vision was melancholic and sad. Can you feel
the depressing mood of the painting? At the top, a tangle of tree limbs, and birds flying into
the cloudy, dark, sunset sky. After the sun sets, the forest will be freezing cold. Rousseau
worked on this painting off-and-on for twenty years. He considered this his most important
painting and refused to sell it during his lifetime.

(Q) Summarize the lecture for Jennifer. Be sure to include two main characteristics of the
school and the example shown. You have 30 seconds to plan.

Prompt (Video)
Jennifer: So, what was the lecture about? What did I miss?
Test-taker: (60 sec response time)

27 26 H. J. Kim
A
p
p
e
n
d
i
x

B
.

A
n
a
l
y
t
i
c

S
c
o
r
i
n
g

R
u
b
r
i
c


M
e
a
n
i
n
g
f
u
l
n
e
s
s

(
C
o
m
m
u
n
i
c
a
t
i
o
n

E
f
f
e
c
t
i
v
e
n
e
s
s
)

I
s

t
h
e

r
e
s
p
o
n
s
e

m
e
a
n
i
n
g
f
u
l

a
n
d

e
f
f
e
c
t
i
v
e
l
y

c
o
m
m
u
n
i
c
a
t
e
d
?


5

E
x
c
e
l
l
e
n
t

4

G
o
o
d

3

A
d
e
q
u
a
t
e

2

F
a
i
r

1

L
i
m
i
t
e
d

0

N
o

T
h
e

r
e
s
p
o
n
s
e
:

T
h
e

r
e
s
p
o
n
s
e
:

T
h
e

r
e
s
p
o
n
s
e
:

T
h
e

r
e
s
p
o
n
s
e
:

T
h
e

r
e
s
p
o
n
s
e
:

T
h
e

r
e
s
p
o
n
s
e
:


i
s

c
o
m
p
l
e
t
e
l
y

m
e
a
n
i
n
g
f
u
l

W
h
a
t

t
h
e

s
p
e
a
k
e
r

w
a
n
t
s

t
o

c
o
n
v
e
y

i
s

c
o
m
p
l
e
t
e
l
y

c
l
e
a
r

a
n
d

e
a
s
y

t
o

u
n
d
e
r
s
t
a
n
d
.


i
s

g
e
n
e
r
a
l
l
y

m
e
a
n
i
n
g
f
u
l

i
n

g
e
n
e
r
a
l
,

w
h
a
t

t
h
e

s
p
e
a
k
e
r

w
a
n
t
s

t
o

c
o
n
v
e
y

i
s

c
l
e
a
r

a
n
d

e
a
s
y

t
o

u
n
d
e
r
s
t
a
n
d
.


o
c
c
a
s
i
o
n
a
l
l
y

d
i
s
p
l
a
y
s

o
b
s
c
u
r
e

p
o
i
n
t
s
;

h
o
w
e
v
e
r
,

m
a
i
n

p
o
i
n
t
s

a
r
e

s
t
i
l
l

c
o
n
v
e
y
e
d
.


o
f
t
e
n

d
i
s
p
l
a
y
s

o
b
s
c
u
r
e

p
o
i
n
t
s
,

l
e
a
v
i
n
g

t
h
e

l
i
s
t
e
n
e
r

c
o
n
f
u
s
e
d
.


i
s

g
e
n
e
r
a
l
l
y

u
n
c
l
e
a
r

a
n
d

e
x
t
r
e
m
e
l
y

h
a
r
d

t
o

u
n
d
e
r
s
t
a
n
d
.


i
s

i
n
c
o
m
p
r
e
h
e
n
s
i
b
l
e
.


i
s

f
u
l
l
y

e
l
a
b
o
r
a
t
e
d
.


i
s

w
e
l
l

e
l
a
b
o
r
a
t
e
d
.


i
n
c
l
u
d
e
s

s
o
m
e

e
l
a
b
o
r
a
t
i
o
n
.


i
n
c
l
u
d
e
s

l
i
t
t
l
e

e
l
a
b
o
r
a
t
i
o
n
.


i
s

n
o
t

w
e
l
l

e
l
a
b
o
r
a
t
e
d
.


c
o
n
t
a
i
n
s

n
o
t

e
n
o
u
g
h

e
v
i
d
e
n
c
e

t
o

e
v
a
l
u
a
t
e


d
e
l
i
v
e
r
s

s
o
p
h
i
s
t
i
c
a
t
e
d

i
d
e
a
s
.


d
e
l
i
v
e
r
s

g
e
n
e
r
a
l
l
y

s
o
p
h
i
s
t
i
c
a
t
e
d

i
d
e
a
s
.


d
e
l
i
v
e
r
s

s
o
m
e
w
h
a
t

s
i
m
p
l
e

i
d
e
a
s
.


d
e
l
i
v
e
r
s

s
i
m
p
l
e

i
d
e
a
s
.


d
e
l
i
v
e
r
s

e
x
t
r
e
m
e
l
y

s
i
m
p
l
e
,

l
i
m
i
t
e
d

i
d
e
a
s
.




G
r
a
m
m
a
t
i
c
a
l

C
o
m
p
e
t
e
n
c
e
:

A
c
c
u
r
a
c
y
,

C
o
m
p
l
e
x
i
t
y

a
n
d

R
a
n
g
e


5

E
x
c
e
l
l
e
n
t

4

G
o
o
d

3

A
d
e
q
u
a
t
e

2

F
a
i
r

1

L
i
m
i
t
e
d

0

N
o

T
h
e

r
e
s
p
o
n
s
e
:

T
h
e

r
e
s
p
o
n
s
e
:

T
h
e

r
e
s
p
o
n
s
e
:

T
h
e

r
e
s
p
o
n
s
e
:

T
h
e

r
e
s
p
o
n
s
e
:

T
h
e

r
e
s
p
o
n
s
e
:


i
s

g
r
a
m
m
a
t
i
c
a
l
l
y

a
c
c
u
r
a
t
e
.


i
s

g
e
n
e
r
a
l
l
y

g
r
a
m
m
a
t
i
c
a
l
l
y

a
c
c
u
r
a
t
e

w
i
t
h
o
u
t

a
n
y

m
a
j
o
r

e
r
r
o
r
s

(
e
.
g
.
,

a
r
t
i
c
l
e

u
s
a
g
e
,

s
u
b
j
e
c
t
/
v
e
r
b

a
g
r
e
e
m
e
n
t
,

e
t
c
.
)

t
h
a
t

o
b
s
c
u
r
e

m
e
a
n
i
n
g
.


r
a
r
e
l
y

d
i
s
p
l
a
y
s

m
a
j
o
r

e
r
r
o
r
s

t
h
a
t

o
b
s
c
u
r
e

m
e
a
n
i
n
g

a
n
d

a

f
e
w

m
i
n
o
r

e
r
r
o
r
s

(
b
u
t

w
h
a
t

t
h
e

s
p
e
a
k
e
r

w
a
n
t
s

t
o

s
a
y

c
a
n

b
e

u
n
d
e
r
s
t
o
o
d
)
.


d
i
s
p
l
a
y
s

s
e
v
e
r
a
l

m
a
j
o
r

e
r
r
o
r
s

a
s

w
e
l
l

a
s

f
r
e
q
u
e
n
t

m
i
n
o
r

e
r
r
o
r
s
,

c
a
u
s
i
n
g

c
o
n
f
u
s
i
o
n

s
o
m
e
t
i
m
e
s
.


i
s

a
l
m
o
s
t

a
l
w
a
y
s

g
r
a
m
m
a
t
i
c
a
l
l
y

i
n
a
c
c
u
r
a
t
e
,

w
h
i
c
h

c
a
u
s
e
s

d
i
f
f
i
c
u
l
t
y

i
n

u
n
d
e
r
s
t
a
n
d
i
n
g

w
h
a
t

t
h
e

s
p
e
a
k
e
r

w
a
n
t
s

t
o

s
a
y
.


d
i
s
p
l
a
y
s

n
o

g
r
a
m
m
a
t
i
c
a
l

c
o
n
t
r
o
l
.


d
i
s
p
l
a
y
s

a

w
i
d
e

r
a
n
g
e

o
f

s
y
n
t
a
c
t
i
c

s
t
r
u
c
t
u
r
e
s

a
n
d

l
e
x
i
c
a
l

f
o
r
m
.


d
i
s
p
l
a
y
s

a

r
e
l
a
t
i
v
e
l
y

w
i
d
e

r
a
n
g
e

o
f

s
y
n
t
a
c
t
i
c

s
t
r
u
c
t
u
r
e
s

a
n
d

l
e
x
i
c
a
l

f
o
r
m
.


d
i
s
p
l
a
y
s

a

s
o
m
e
w
h
a
t

n
a
r
r
o
w

r
a
n
g
e

o
f

s
y
n
t
a
c
t
i
c

s
t
r
u
c
t
u
r
e
s
;

t
o
o

m
a
n
y

s
i
m
p
l
e

s
e
n
t
e
n
c
e
s
.


d
i
s
p
l
a
y
s

a

n
a
r
r
o
w

r
a
n
g
e

o
f

s
y
n
t
a
c
t
i
c

s
t
r
u
c
t
u
r
e
s
,

l
i
m
i
t
e
d

t
o

s
i
m
p
l
e

s
e
n
t
e
n
c
e
s
.


d
i
s
p
l
a
y
s

l
a
c
k

o
f

b
a
s
i
c

s
e
n
t
e
n
c
e

s
t
r
u
c
t
u
r
e

k
n
o
w
l
e
d
g
e
.


d
i
s
p
l
a
y
s

s
e
v
e
r
e
l
y

l
i
m
i
t
e
d

o
r

n
o

r
a
n
g
e

a
n
d

s
o
p
h
i
s
t
i
c
a
t
i
o
n

o
f

g
r
a
m
m
a
t
i
c
a
l

s
t
r
u
c
t
u
r
e

a
n
d

l
e
x
i
c
a
l

f
o
r
m
.


d
i
s
p
l
a
y
s

c
o
m
p
l
e
x

s
y
n
t
a
c
t
i
c

s
t
r
u
c
t
u
r
e
s

(
r
e
l
a
t
i
v
e

c
l
a
u
s
e
,

e
m
b
e
d
d
e
d

c
l
a
u
s
e
,

p
a
s
s
i
v
e

v
o
i
c
e
,

e
t
c
.
)

a
n
d

l
e
x
i
c
a
l

f
o
r
m
.


d
i
s
p
l
a
y
s

r
e
l
a
t
i
v
e
l
y

c
o
m
p
l
e
x

s
y
n
t
a
c
t
i
c

s
t
r
u
c
t
u
r
e
s

a
n
d

l
e
x
i
c
a
l

f
o
r
m
.


d
i
s
p
l
a
y
s

s
o
m
e
w
h
a
t

s
i
m
p
l
e

s
y
n
t
a
c
t
i
c

s
t
r
u
c
t
u
r
e
s


d
i
s
p
l
a
y
s

u
s
e

o
f

s
i
m
p
l
e

a
n
d

i
n
a
c
c
u
r
a
t
e

l
e
x
i
c
a
l

f
o
r
m
.


d
i
s
p
l
a
y
s

g
e
n
e
r
a
l
l
y


b
a
s
i
c

l
e
x
i
c
a
l

f
o
r
m
.


c
o
n
t
a
i
n
s

n
o
t

e
n
o
u
g
h

e
v
i
d
e
n
c
e

t
o

e
v
a
l
u
a
t
e
.



d
i
s
p
l
a
y
s

u
s
e

o
f

s
o
m
e
w
h
a
t

s
i
m
p
l
e

o
r

i
n
a
c
c
u
r
a
t
e

l
e
x
i
c
a
l

f
o
r
m
.





27 Investigating the Construct Validity of a Speaking Performance Test
G
r
a
m
m
a
t
i
c
a
l

C
o
m
p
e
t
e
n
c
e
:

A
c
c
u
r
a
c
y
,

C
o
m
p
l
e
x
i
t
y

a
n
d

R
a
n
g
e


5

E
x
c
e
l
l
e
n
t

4

G
o
o
d

3

A
d
e
q
u
a
t
e

2

F
a
i
r

1

L
i
m
i
t
e
d

0

N
o

T
h
e

r
e
s
p
o
n
s
e
:

T
h
e

r
e
s
p
o
n
s
e
:

T
h
e

r
e
s
p
o
n
s
e
:

T
h
e

r
e
s
p
o
n
s
e
:

T
h
e

r
e
s
p
o
n
s
e
:

T
h
e

r
e
s
p
o
n
s
e
:


i
s

g
r
a
m
m
a
t
i
c
a
l
l
y

a
c
c
u
r
a
t
e
.


i
s

g
e
n
e
r
a
l
l
y

g
r
a
m
m
a
t
i
c
a
l
l
y

a
c
c
u
r
a
t
e

w
i
t
h
o
u
t

a
n
y

m
a
j
o
r

e
r
r
o
r
s

(
e
.
g
.
,

a
r
t
i
c
l
e

u
s
a
g
e
,

s
u
b
j
e
c
t
/
v
e
r
b

a
g
r
e
e
m
e
n
t
,

e
t
c
.
)

t
h
a
t

o
b
s
c
u
r
e

m
e
a
n
i
n
g
.


r
a
r
e
l
y

d
i
s
p
l
a
y
s

m
a
j
o
r

e
r
r
o
r
s

t
h
a
t

o
b
s
c
u
r
e

m
e
a
n
i
n
g

a
n
d

a

f
e
w

m
i
n
o
r

e
r
r
o
r
s

(
b
u
t

w
h
a
t

t
h
e

s
p
e
a
k
e
r

w
a
n
t
s

t
o

s
a
y

c
a
n

b
e

u
n
d
e
r
s
t
o
o
d
)
.


d
i
s
p
l
a
y
s

s
e
v
e
r
a
l

m
a
j
o
r

e
r
r
o
r
s

a
s

w
e
l
l

a
s

f
r
e
q
u
e
n
t

m
i
n
o
r

e
r
r
o
r
s
,

c
a
u
s
i
n
g

c
o
n
f
u
s
i
o
n

s
o
m
e
t
i
m
e
s
.


i
s

a
l
m
o
s
t

a
l
w
a
y
s

g
r
a
m
m
a
t
i
c
a
l
l
y

i
n
a
c
c
u
r
a
t
e
,

w
h
i
c
h

c
a
u
s
e
s

d
i
f
f
i
c
u
l
t
y

i
n

u
n
d
e
r
s
t
a
n
d
i
n
g

w
h
a
t

t
h
e

s
p
e
a
k
e
r

w
a
n
t
s

t
o

s
a
y
.


d
i
s
p
l
a
y
s

n
o

g
r
a
m
m
a
t
i
c
a
l

c
o
n
t
r
o
l
.


d
i
s
p
l
a
y
s

a

w
i
d
e

r
a
n
g
e

o
f

s
y
n
t
a
c
t
i
c

s
t
r
u
c
t
u
r
e
s

a
n
d

l
e
x
i
c
a
l

f
o
r
m
.


d
i
s
p
l
a
y
s

a

r
e
l
a
t
i
v
e
l
y

w
i
d
e

r
a
n
g
e

o
f

s
y
n
t
a
c
t
i
c

s
t
r
u
c
t
u
r
e
s

a
n
d

l
e
x
i
c
a
l

f
o
r
m
.


d
i
s
p
l
a
y
s

a

s
o
m
e
w
h
a
t

n
a
r
r
o
w

r
a
n
g
e

o
f

s
y
n
t
a
c
t
i
c

s
t
r
u
c
t
u
r
e
s
;

t
o
o

m
a
n
y

s
i
m
p
l
e

s
e
n
t
e
n
c
e
s
.


d
i
s
p
l
a
y
s

a

n
a
r
r
o
w

r
a
n
g
e

o
f

s
y
n
t
a
c
t
i
c

s
t
r
u
c
t
u
r
e
s
,

l
i
m
i
t
e
d

t
o

s
i
m
p
l
e

s
e
n
t
e
n
c
e
s
.


d
i
s
p
l
a
y
s

l
a
c
k

o
f

b
a
s
i
c

s
e
n
t
e
n
c
e

s
t
r
u
c
t
u
r
e

k
n
o
w
l
e
d
g
e
.


d
i
s
p
l
a
y
s

s
e
v
e
r
e
l
y

l
i
m
i
t
e
d

o
r

n
o

r
a
n
g
e

a
n
d

s
o
p
h
i
s
t
i
c
a
t
i
o
n

o
f

g
r
a
m
m
a
t
i
c
a
l

s
t
r
u
c
t
u
r
e

a
n
d

l
e
x
i
c
a
l

f
o
r
m
.


d
i
s
p
l
a
y
s

c
o
m
p
l
e
x

s
y
n
t
a
c
t
i
c

s
t
r
u
c
t
u
r
e
s

(
r
e
l
a
t
i
v
e

c
l
a
u
s
e
,

e
m
b
e
d
d
e
d

c
l
a
u
s
e
,

p
a
s
s
i
v
e

v
o
i
c
e
,

e
t
c
.
)

a
n
d

l
e
x
i
c
a
l

f
o
r
m
.


d
i
s
p
l
a
y
s

r
e
l
a
t
i
v
e
l
y

c
o
m
p
l
e
x

s
y
n
t
a
c
t
i
c

s
t
r
u
c
t
u
r
e
s

a
n
d

l
e
x
i
c
a
l

f
o
r
m
.


d
i
s
p
l
a
y
s

s
o
m
e
w
h
a
t

s
i
m
p
l
e

s
y
n
t
a
c
t
i
c

s
t
r
u
c
t
u
r
e
s


d
i
s
p
l
a
y
s

u
s
e

o
f

s
i
m
p
l
e

a
n
d

i
n
a
c
c
u
r
a
t
e

l
e
x
i
c
a
l

f
o
r
m
.


d
i
s
p
l
a
y
s

g
e
n
e
r
a
l
l
y


b
a
s
i
c

l
e
x
i
c
a
l

f
o
r
m
.


c
o
n
t
a
i
n
s

n
o
t

e
n
o
u
g
h

e
v
i
d
e
n
c
e

t
o

e
v
a
l
u
a
t
e
.



d
i
s
p
l
a
y
s

u
s
e

o
f

s
o
m
e
w
h
a
t

s
i
m
p
l
e

o
r

i
n
a
c
c
u
r
a
t
e

l
e
x
i
c
a
l

f
o
r
m
.





29 28 H. J. Kim
D
i
s
c
o
u
r
s
e

C
o
m
p
e
t
e
n
c
e
:

O
r
g
a
n
i
z
a
t
i
o
n

a
n
d

C
o
h
e
s
i
o
n


5

E
x
c
e
l
l
e
n
t

4

G
o
o
d

3

A
d
e
q
u
a
t
e

2

F
a
i
r

1

L
i
m
i
t
e
d

0

N
o

T
h
e

r
e
s
p
o
n
s
e
:

T
h
e

r
e
s
p
o
n
s
e
:

T
h
e

r
e
s
p
o
n
s
e
:

T
h
e

r
e
s
p
o
n
s
e
:

T
h
e

r
e
s
p
o
n
s
e
:

T
h
e

r
e
s
p
o
n
s
e
:


i
s

c
o
m
p
l
e
t
e
l
y

c
o
h
e
r
e
n
t
.


i
s

g
e
n
e
r
a
l
l
y

c
o
h
e
r
e
n
t
.


i
s

o
c
c
a
s
i
o
n
a
l
l
y

i
n
c
o
h
e
r
e
n
t
.


i
s

l
o
o
s
e
l
y

o
r
g
a
n
i
z
e
d
,

r
e
s
u
l
t
i
n
g

i
n

g
e
n
e
r
a
l
l
y

d
i
s
j
o
i
n
t
e
d

d
i
s
c
o
u
r
s
e
.


i
s

g
e
n
e
r
a
l
l
y

i
n
c
o
h
e
r
e
n
t
.


i
s

i
n
c
o
h
e
r
e
n
t
.


i
s

l
o
g
i
c
a
l
l
y

s
t
r
u
c
t
u
r
e
d

l
o
g
i
c
a
l

o
p
e
n
i
n
g
s

a
n
d

c
l
o
s
u
r
e
s
;

l
o
g
i
c
a
l

d
e
v
e
l
o
p
m
e
n
t

o
f

i
d
e
a
s
.


d
i
s
p
l
a
y
s

g
e
n
e
r
a
l
l
y

l
o
g
i
c
a
l

s
t
r
u
c
t
u
r
e
.


c
o
n
t
a
i
n
s

p
a
r
t
s

t
h
a
t

d
i
s
p
l
a
y

s
o
m
e
w
h
a
t

i
l
l
o
g
i
c
a
l

o
r

u
n
c
l
e
a
r

o
r
g
a
n
i
z
a
t
i
o
n
;

h
o
w
e
v
e
r
,

a
s

a

w
h
o
l
e
,

i
t

i
s

i
n

g
e
n
e
r
a
l

l
o
g
i
c
a
l
l
y

s
t
r
u
c
t
u
r
e
d
.


o
f
t
e
n

d
i
s
p
l
a
y
s

i
l
l
o
g
i
c
a
l

o
r

u
n
c
l
e
a
r

o
r
g
a
n
i
z
a
t
i
o
n
,

c
a
u
s
i
n
g

s
o
m
e

c
o
n
f
u
s
i
o
n
.


d
i
s
p
l
a
y
s

i
l
l
o
g
i
c
a
l

o
r

u
n
c
l
e
a
r

o
r
g
a
n
i
z
a
t
i
o
n
,

c
a
u
s
i
n
g

g
r
e
a
t

c
o
n
f
u
s
i
o
n
.


d
i
s
p
l
a
y
s

v
i
r
t
u
a
l
l
y

n
o
n
-
e
x
i
s
t
e
n
t

o
r
g
a
n
i
z
a
t
i
o
n
.


a
t

t
i
m
e
s

d
i
s
p
l
a
y
s

s
o
m
e
w
h
a
t

l
o
o
s
e

c
o
n
n
e
c
t
i
o
n

o
f

i
d
e
a
s
.


d
i
s
p
l
a
y
s

s
m
o
o
t
h

c
o
n
n
e
c
t
i
o
n

a
n
d

t
r
a
n
s
i
t
i
o
n

o
f

i
d
e
a
s

b
y

m
e
a
n
s

o
f

v
a
r
i
o
u
s

c
o
h
e
s
i
v
e

d
e
v
i
c
e
s

(
l
o
g
i
c
a
l

c
o
n
n
e
c
t
o
r
s
,

a

c
o
n
t
r
o
l
l
i
n
g

t
h
e
m
e
,

r
e
p
e
t
i
t
i
o
n

o
f

k
e
y

w
o
r
d
s
,

e
t
c
.
)
.


d
i
s
p
l
a
y
s

g
o
o
d

u
s
e

o
f

c
o
h
e
s
i
v
e

d
e
v
i
c
e
s

t
h
a
t

g
e
n
e
r
a
l
l
y

c
o
n
n
e
c
t

i
d
e
a
s

s
m
o
o
t
h
l
y
.


d
i
s
p
l
a
y
s

u
s
e

o
f

s
i
m
p
l
e

c
o
h
e
s
i
v
e

d
e
v
i
c
e
s
.


d
i
s
p
l
a
y
s

r
e
p
e
t
i
t
i
v
e

u
s
e

o
f

s
i
m
p
l
e

c
o
h
e
s
i
v
e

d
e
v
i
c
e
s
;

u
s
e

o
f

c
o
h
e
s
i
v
e

d
e
v
i
c
e
s

a
r
e

n
o
t

a
l
w
a
y
s

e
f
f
e
c
t
i
v
e
.


d
i
s
p
l
a
y
s

a
t
t
e
m
p
t
s

t
o

u
s
e

c
o
h
e
s
i
v
e

d
e
v
i
c
e
s
,

b
u
t

t
h
e
y

a
r
e

e
i
t
h
e
r

q
u
i
t
e

m
e
c
h
a
n
i
c
a
l

o
r

i
n
a
c
c
u
r
a
t
e

l
e
a
v
i
n
g

t
h
e

l
i
s
t
e
n
e
r

c
o
n
f
u
s
e
d
.


c
o
n
t
a
i
n
s

n
o
t

e
n
o
u
g
h

e
v
i
d
e
n
c
e

t
o

e
v
a
l
u
a
t
e
.



29 Investigating the Construct Validity of a Speaking Performance Test
T
a
s
k

C
o
m
p
l
e
t
i
o
n

T
o

w
h
a
t

e
x
t
e
n
t

d
o
e
s

t
h
e

s
p
e
a
k
e
r

c
o
m
p
l
e
t
e

t
h
e

t
a
s
k
?


5

E
x
c
e
l
l
e
n
t

4

G
o
o
d

3

A
d
e
q
u
a
t
e

2

F
a
i
r

1

L
i
m
i
t
e
d

0

N
o

T
h
e

r
e
s
p
o
n
s
e
:

T
h
e

r
e
s
p
o
n
s
e
:

T
h
e

r
e
s
p
o
n
s
e
:

T
h
e

r
e
s
p
o
n
s
e
:

T
h
e

r
e
s
p
o
n
s
e
:

T
h
e

r
e
s
p
o
n
s
e
:


f
u
l
l
y

a
d
d
r
e
s
s
e
s

t
h
e

t
a
s
k
.


a
d
d
r
e
s
s
e
s

t
h
e

t
a
s
k

w
e
l
l


a
d
e
q
u
a
t
e
l
y

a
d
d
r
e
s
s
e
s

t
h
e

t
a
s
k
.


i
n
s
u
f
f
i
c
i
e
n
t
l
y

a
d
d
r
e
s
s
e
s

t
h
e

t
a
s
k
.


b
a
r
e
l
y

a
d
d
r
e
s
s
e
s

t
h
e

t
a
s
k
.


s
h
o
w
s

n
o

u
n
d
e
r
s
t
a
n
d
i
n
g

o
f

t
h
e

p
r
o
m
p
t
.


d
i
s
p
l
a
y
s

c
o
m
p
l
e
t
e
l
y

a
c
c
u
r
a
t
e

u
n
d
e
r
s
t
a
n
d
i
n
g

o
f

t
h
e

p
r
o
m
p
t

w
i
t
h
o
u
t

a
n
y

m
i
s
u
n
d
e
r
s
t
o
o
d

p
o
i
n
t
s
.


i
n
c
l
u
d
e
s

n
o

n
o
t
i
c
e
a
b
l
y

m
i
s
u
n
d
e
r
s
t
o
o
d

p
o
i
n
t
s
.


i
n
c
l
u
d
e
s

m
i
n
o
r

m
i
s
u
n
d
e
r
s
t
a
n
d
i
n
g
(
s
)

t
h
a
t

d
o
e
s

n
o
t

i
n
t
e
r
f
e
r
e

w
i
t
h

t
a
s
k

f
u
l
f
i
l
l
m
e
n
t
.


d
i
s
p
l
a
y
s

s
o
m
e

m
a
j
o
r

i
n
c
o
m
p
r
e
h
e
n
s
i
o
n
/

m
i
s
u
n
d
e
r
s
t
a
n
d
i
n
g
(
s
)

t
h
a
t

i
n
t
e
r
f
e
r
e
s

w
i
t
h

s
u
c
c
e
s
s
f
u
l

t
a
s
k

c
o
m
p
l
e
t
i
o
n
.


d
i
s
p
l
a
y
s

m
a
j
o
r

i
n
c
o
m
p
r
e
h
e
n
s
i
o
n
/

m
i
s
u
n
d
e
r
s
t
a
n
d
i
n
g
(
s
)

t
h
a
t

i
n
t
e
r
f
e
r
e
s

w
i
t
h

a
d
d
r
e
s
s
i
n
g

t
h
e

t
a
s
k
.


c
o
n
t
a
i
n
s

n
o
t

e
n
o
u
g
h

e
v
i
d
e
n
c
e

t
o

e
v
a
l
u
a
t
e
.


c
o
m
p
l
e
t
e
l
y

c
o
v
e
r
s

a
l
l

m
a
i
n

p
o
i
n
t
s

w
i
t
h

c
o
m
p
l
e
t
e

d
e
t
a
i
l
s

d
i
s
c
u
s
s
e
d

i
n

t
h
e

p
r
o
m
p
t
.




c
o
m
p
l
e
t
e
l
y

c
o
v
e
r
s

a
l
l

m
a
i
n

p
o
i
n
t
s

w
i
t
h

a

g
o
o
d

a
m
o
u
n
t

o
f

d
e
t
a
i
l
s

d
i
s
c
u
s
s
e
d

i
n

t
h
e

p
r
o
m
p
t
.

(
e
.
g
.
,
)

E
l
e
c
t
r
i
c

C
a
r
s
:

t
w
o

p
r
o
b
l
e
m
s

w
i
t
h

t
h
e

c
u
r
r
e
n
t

t
e
c
h
n
o
l
o
g
y

(
b
a
t
t
e
r
y

r
u
n
n
i
n
g

o
u
t

q
u
i
c
k
l
y

a
n
d

i
n
c
o
n
v
e
n
i
e
n
c
e

i
n

r
e
c
h
a
r
g
i
n
g
)

B
a
r
b
i
z
o
n

S
c
h
o
o
l
:

2

c
h
a
r
a
c
t
e
r
i
s
t
i
c
s

o
f

t
h
e

s
c
h
o
o
l

a
n
d

o
n
e

e
x
a
m
p
l
e

(
p
a
i
n
t
e
d

n
a
t
u
r
e

a
n
d

e
s
t
a
b
l
i
s
h
e
d

l
a
n
d
s
c
a
p
i
n
g

a
s

a
n

i
n
d
e
p
e
n
d
e
n
t

g
e
n
r
e
,

a
n
d

t
h
e

F
o
r
e
s
t

i
n

t
h
e

s
u
n
s
e
t

a
s

a
n

e
x
a
m
p
l
e
)

O
R


t
o
u
c
h
e
s

u
p
o
n

a
l
l

m
a
i
n

p
o
i
n
t
s
,

b
u
t

l
e
a
v
e
s

o
u
t

d
e
t
a
i
l
s
.

O
R


c
o
m
p
l
e
t
e
l
y

c
o
v
e
r
s

o
n
e

(
o
r

t
w
o
)

m
a
i
n

p
o
i
n
t
s

w
i
t
h

d
e
t
a
i
l
s
,

b
u
t

l
e
a
v
e
s

t
h
e

r
e
s
t

o
u
t
.


O
R


t
o
u
c
h
e
s

u
p
o
n

b
i
t
s

a
n
d

p
i
e
c
e
s

o
f

t
h
e

p
r
o
m
p
t
s
.




PB 30 H. J. Kim
I
n
t
e
l
l
i
g
i
b
i
l
i
t
y

P
r
o
n
u
n
c
i
a
t
i
o
n

a
n
d

p
r
o
s
o
d
i
c

f
e
a
t
u
r
e
s

(
i
n
t
o
n
a
t
i
o
n
,

r
h
y
t
h
m
,

a
n
d

p
a
c
i
n
g
)


5

E
x
c
e
l
l
e
n
t

4

G
o
o
d

3

A
d
e
q
u
a
t
e

2

F
a
i
r

1

L
i
m
i
t
e
d

0

N
o

T
h
e

r
e
s
p
o
n
s
e
:

T
h
e

r
e
s
p
o
n
s
e
:

T
h
e

r
e
s
p
o
n
s
e
:

T
h
e

r
e
s
p
o
n
s
e
:

T
h
e

r
e
s
p
o
n
s
e
:

T
h
e

r
e
s
p
o
n
s
e
:


i
s

c
o
m
p
l
e
t
e
l
y

i
n
t
e
l
l
i
g
i
b
l
e

a
l
t
h
o
u
g
h

a
c
c
e
n
t

m
a
y

b
e

t
h
e
r
e
.


m
a
y

i
n
c
l
u
d
e

m
i
n
o
r

d
i
f
f
i
c
u
l
t
i
e
s

w
i
t
h

p
r
o
n
u
n
c
i
a
t
i
o
n

o
r

i
n
t
o
n
a
t
i
o
n
,

b
u
t

g
e
n
e
r
a
l
l
y

i
n
t
e
l
l
i
g
i
b
l
e
.


m
a
y

l
a
c
k

i
n
t
e
l
l
i
g
i
b
i
l
i
t
y

i
n

p
l
a
c
e
s

i
m
p
e
d
i
n
g

c
o
m
m
u
n
i
c
a
t
i
o
n
.


o
f
t
e
n

l
a
c
k
s

i
n
t
e
l
l
i
g
i
b
i
l
i
t
y

i
m
p
e
d
i
n
g

c
o
m
m
u
n
i
c
a
t
i
o
n
.


g
e
n
e
r
a
l
l
y

l
a
c
k
s

i
n
t
e
l
l
i
g
i
b
i
l
i
t
y
.


c
o
m
p
l
e
t
e
l
y

l
a
c
k
s

i
n
t
e
l
l
i
g
i
b
i
l
i
t
y
.



i
s

a
l
m
o
s
t

a
l
w
a
y
s

c
l
e
a
r
,

f
l
u
i
d

a
n
d

s
u
s
t
a
i
n
e
d
.


i
s

g
e
n
e
r
a
l
l
y

c
l
e
a
r
,

f
l
u
i
d

a
n
d

s
u
s
t
a
i
n
e
d
.

P
a
c
e

m
a
y

v
a
r
y

a
t

t
i
m
e
s
.


e
x
h
i
b
i
t
s

s
o
m
e

d
i
f
f
i
c
u
l
t
i
e
s

w
i
t
h

p
r
o
n
u
n
c
i
a
t
i
o
n
,

i
n
t
o
n
a
t
i
o
n

o
r

p
a
c
i
n
g
.


f
r
e
q
u
e
n
t
l
y

e
x
h
i
b
i
t
s

p
r
o
b
l
e
m
s

w
i
t
h

p
r
o
n
u
n
c
i
a
t
i
o
n
,

i
n
t
o
n
a
t
i
o
n

o
r

p
a
c
i
n
g
.


i
s

g
e
n
e
r
a
l
l
y

u
n
c
l
e
a
r
,

c
h
o
p
p
y
,

f
r
a
g
m
e
n
t
e
d

o
r

t
e
l
e
g
r
a
p
h
i
c
.


c
o
n
t
a
i
n
s

n
o
t

e
n
o
u
g
h

e
v
i
d
e
n
c
e

t
o

e
v
a
l
u
a
t
e
.


d
o
e
s

n
o
t

r
e
q
u
i
r
e

l
i
s
t
e
n
e
r

e
f
f
o
r
t
.



d
o
e
s

n
o
t

r
e
q
u
i
r
e

l
i
s
t
e
n
e
r

e
f
f
o
r
t

m
u
c
h
.


e
x
h
i
b
i
t
s

s
o
m
e

f
l
u
i
d
i
t
y
.


m
a
y

n
o
t

b
e

s
u
s
t
a
i
n
e
d

a
t

a

c
o
n
s
i
s
t
e
n
t

l
e
v
e
l

t
h
r
o
u
g
h
o
u
t
.



c
o
n
t
a
i
n
s

f
r
e
q
u
e
n
t

p
a
u
s
e
s

a
n
d

h
e
s
i
t
a
t
i
o
n
s
.




m
a
y

r
e
q
u
i
r
e

s
o
m
e

l
i
s
t
e
n
e
r

e
f
f
o
r
t
s

a
t

t
i
m
e
s
.


m
a
y

r
e
q
u
i
r
e

s
i
g
n
i
f
i
c
a
n
t

l
i
s
t
e
n
e
r

e
f
f
o
r
t

a
t

t
i
m
e
s
.


c
o
n
t
a
i
n
s

c
o
n
s
i
s
t
e
n
t

p
r
o
n
u
n
c
i
a
t
i
o
n

a
n
d

i
n
t
o
n
a
t
i
o
n

p
r
o
b
l
e
m
s
.






r
e
q
u
i
r
e
s

c
o
n
s
i
d
e
r
a
b
l
e

l
i
s
t
e
n
e
r

e
f
f
o
r
t
.

You might also like