You are on page 1of 3

The First RWPT (lower form) The first RWPT (Appendix B), was developed to address some of the

shortcomings of the existing exam via increased task authenticity, an analytic scoring rubric for the writing subtest, and balanced portions of R&W tasks. I recruited nine IEP student volunteers to take the pilot test on Oct. 7, 2001. The following is a brief discussion of the test results with a focus on statistical analysis and validity, which sheds light on the revision steps taken for developing the current RWPT. Reliability. Assuming the two halves of the test independently measure the test takers skills, the Guttman split-half technique can be used to calculate the reliability coefficient (Bachman, 1990). To calculate the internal consistency, I combined the reading items in sections I and II together. The first reason for this method is that both sections are in the same multiple choice (MC) format; the second is that the length of each section is relatively short. The final result is 0.76, which indicates that 76% of variances are measured by the test with 24% measurement error. The writing tasks were graded by an R&W IEP teacher (rater 1) and myself (rater 2). The correlation coefficient can be calculated between two sets of scores to arrive at an estimate of the reliability of the judgments made by inter-raters (Brown, 2005). To determine inter-rater reliability, I used the Microsoft Excel Correlation Function. Table 1 Inter-rater Reliability among the Writing Tasks (N=9) W* Task 1 (Section I) Rater 1 Rater 2 5.5 5.5 6 5.5 6 5.5 6 6 5.5 5.5 5 3.5 6 4.5 5.5 4 3 4 IR*= .57 W Task 1 (Section II) Rater 1 Rater 2 5 5 5 5 5.5 5.5 6 5.5 6 5 4.5 4.5 5.5 5 4 3 4 3 IR= .88 W Task 2 (Section II) Rater 1 Rater 2 5.5 5 5.5 5.5 5 5 5.5 5.5 6 5 5.5 5.5 6 5 1 N/A N/A 4 4 IR= .97

Student ID 1 2 3 4 5 6 7 8 9

Note. W*: writing; IR*: inter-rater reliability

The inter-rater reliability for both tasks 1 and 2 in Section II are much higher than that that in Section I. I set up a meeting with rater 1 in order to find out what may account for our grading discrepancies. Additionally, I intended to seek feedback from his experience using the analytic rubric (Appendix C). We both realized that the grammar mistakes produced bore more variety and complexity than the way stipulated in the rubric. We felt that the poor definition of grammar use, and the average short writing sample the task elicited (under 50 words), may have caused gaps in counting writing errors, which in turn contributed to grading differences in result.
1

As mentioned above, student 8 was not able to produce his writing due to his failure to comprehend the reading prompt.

Another salient factor in discrepancy may be the length of writing. Most respondents provided a longer writing sample for Task 2 than for Task 1. Rater 1 and I acknowledged that it was easier to assign a level to individual skill area if more complexity is involved. This ease may be the result of simplistic writings (e.g., Task 1) tending to incur fewer errors, which may subject the evaluators to the dilemma of basing their judgment on one or two minor errors. Task 2 in Section II involves writing a 3-5 sentence summary of a news story. Despite its short length, the reliability coefficient was rather high (0.97). Most of the test takers had used parts of the original sentences from the article to construct their summary. Their responses appeared much more homogenized and standardized than the previous two tasks, which may have contributed to less grading disputes, and in turn higher inter-rater reliability. However, it casts alarming doubt on the tasks ability to elicit the real linguistic performance. This task apparently needs to be revised or replaced. Item analysis. Item analysis was conducted to gain a clear picture of the performance of each individual MC itemc (Table 2.1 & 2.2). Table 3 breaks the IF values into three groups to shed some light on whether the reading subtest presents variances in difficulty levels. To ensure a diversity in difficulty level, a placement test should contain easy items (IF values 0.81-0.95) and difficult items (IF values 0.35-0.59) on both spectrums; most items should demonstrate an IF value in the range of 0.60-0.80 (Turner, 2011). To align with this standard seems to reveal a need for more difficult reading tasks in order to boost up the 38% in the middle range and 24% on the lower end of the IF. In addition the ID analysis provides some preliminary assessment on the discriminating ability among MC items, which helps me decide which items to be selected and carried over in the revised version. Table 2.1 Section I Item Facility and Item Discrimination (N=9) Item No. Number of correct IF responses 1 8 .89 2 9 1.00 3 5 .56 4 9 1.00 5 4 .46 6 9 1.00 7 7 .78 8 2 .22 9 9 1.00 10 8 .89 11 7 .78 Average .78 Table 2.2 Section II Item Facility and Item Discrimination (N=9) Item No. Number of correct IF responses 1 5 .56

ID .33 0 .67 0 0 0 0 .33 0 .33 .67 .21

ID .33

2 3 4 5 6 7 8 9 10 Average

6 7 8 7 6 6 4 8 4

.67 .78 .89 .78 .67 .67 .44 .89 .67 .70

.67 .33 0 .67 1 1 .33 0 .67 .50

Table 3 Distribution of Lower, Middle and Upper Range of IFs (N=9) IF value range Lower (0.22 -0.59)
2

Middle (0.60-0.80)

Upper (0.81-1)

Percentage

24%

38%

38%

The cut-off point for the lower range starts at 0.22, which is the lowest IF value from this pilot test.

You might also like