An Evaluation of Item Response Theory Classification Accuracy and Consistency Indices

Article
An Evaluation of Item Response Theory Classification Accuracy and Consistency Indices

Adam E. Wyse1 and Shiqi Hao2
Applied Psychological Measurement 36(7) 602624 The Author(s) 2012 Reprints and permission: sagepub.com/journalsPermissions.nav DOI: 10.1177/0146621612451522 http://apm.sagepub.com
Abstract This article introduces two new classification consistency indices that can be used when item response theory (IRT) models have been applied. The new indices are shown to be related to Rudners classification accuracy index and Guos classification accuracy index. The Rudner- and Guo-based classification accuracy and consistency indices are evaluated and compared with estimates from the more commonly applied IRT-recursive procedure using a simulation study and data from two large-scale assessments. Results from the simulation study and practical examples suggested that the Guo- and Rudner-based indices tended to produce estimates that were closer to the simulated values and exceeded those from the IRT-recursive-based procedure. However, results did suggest that the Rudner- and Guo-based indices can have some undesirable features that are important to keep in mind when applying them in practice. The values of the classification accuracy and consistency indices appeared to be affected by a number of factors including the distribution of examinees, test length, the placement of the cut-scores, and the proficiency estimators applied to estimate examinee ability. Suggestions are made that an important part of investigations evaluating classification accuracy and consistency indices should be the creation of figures that show the value of the classification accuracy and classification consistency for individual examinees across the range of possible scores as these figures can help provide indications into subtle and important differences between indices. Keywords classification consistency, classification accuracy, item response theory, cut-scores, u metric, number-correct scores
One common use of test scores after they are computed is to compare the test scores with cutscores to determine the level of performance that an examinee achieved on the assessment. Based on the scores that students receive, students are classified into different levels on the
1 2
Michigan Department of Education, Arden Hills, MN, USA Michigan Department of Education, Lansing, USA
Corresponding Author: Adam E. Wyse, Michigan Department of Education, Bureau of Assessment and Accountability, 1813 Chatham Ave., Arden Hills, MN 55112, USA Email: WyseA@michigan.gov
Wyse and Hao
603
assessment and decisions are made based on those classifications. A critical measurement concern when using cut-scores to make decisions is the classification accuracy and the classification consistency expected on the assessment. Classification consistency is the degree to which examinees would be classified into the same performance categories over parallel replications of the same assessment (Lee, 2010). Classification accuracy is the degree to which observed classification would agree with true classifications assuming known cut-scores on a single assessment (Lee, 2010). There are numerous procedures for computing classification accuracy and classification consistency. Procedures for computing classification accuracy and consistency have been discussed in Huynh (1976); Subkoviak (1976); Hanson and Brennan (1990); Livingston and Lewis (1995); Schulz, Kolen, and Nicewander (1999); Wang, Kolen, and Harris (2000); Rudner (2001, 2005); Lee, Hanson, and Brennan (2002); Brennan and Wan (2004); Guo (2006); Martineau (2007); Lee, Brennan, and Wan (2009); and Lee (2010). Lee (2010) provided an excellent summary of many of the approaches and gave empirical examples of how several of the procedures work for computing classification accuracy and classification consistency. Most of the procedures assume that scores are reported in the number-correct score metric and differ primarily in the models used in calculating the indices (e.g., beta-binomial model, application of item response theory [IRT] recursive formula), and whether an examinee distribution or each examinee score is considered in computing the classification accuracy and consistency indices. Lee refers to methods that use examinee distributions as distribution methods and methods that use examinee scores as person methods. Rudners (2001, 2005) classification accuracy index and Guos (2006) classification accuracy index are somewhat different from the approaches discussed in Lee (2010) in that they can be applied to data that is scored in the IRT u metric or a linear transformation of this metric. This characteristic of these two indices distinguishes them from the other methods. No approach for computing classification consistency currently exists when data are reported in the IRT u metric or a linear transformation of this metric. Given that no such index has been formulated and research has not been conducted to compare Rudners or Guos approaches with the other more commonly used indices, such as the IRT-recursive procedure, which assumes that the reporting metric is number-correct scores, an important question is to what extent indices based on Rudners or Guos formulations differ from other more commonly used approaches. The purposes of this article are to introduce two new IRT-based classification consistency indices, one that is an extension of Rudners classification accuracy index and one that is an extension of Guos classification accuracy index, and explore how classification accuracy and consistency indices based on Rudners and Guos formulations and the IRT-recursive procedure perform in simulation and practice. In the next section of this article, Rudners classification accuracy index is reviewed and the new index for computing classification consistency that is an extension of Rudners index is introduced. This is followed by a discussion of Guos classification accuracy index and the introduction of a new classification consistency index that is an extension of Guos formulation. These indices are then contrasted with the more commonly used approach for computing classification accuracy and consistency with IRT that uses the IRT-recursive formula discussed in Schulz et al. (1999), Wang et al. (2000), Lee et al. (2002), and Lee (2010). A simulation study is then provided to evaluate the performance of the different indices with various proficiency estimators, two different ability distributions, two different test lengths, and three different sets of cut-scores. Practical examples from two large-scale assessments then show the values of indices with various proficiency estimators in practical situations. The article concludes with discussion and some areas for future research.
604
Applied Psychological Measurement 36(7)
Classification Accuracy and Consistency Indices Rudner-Based Indices

Rudners (2001, 2005) classification accuracy index uses three data vectors to compute classification accuracy (Martineau, 2007). The first vector is a vector of C + 1 cut-scores:
k = k1 k2 kC + 1 , where k1 \ k2 \ \ kC + 1 and k1 = , kC + 1 = : 1
This vector of cut-scores contains the operational cut-scores on the assessment, and the lower and upper bounds for all categories. For example, if there are three operational cut-scores, the vector in Equation 1 would contain the three operational cut-scores and positive and negative infinities. The second vector is the vector of estimated examinee scores, which can be represented as
0 ^= ^ u u1 ^ u2 ^ u Ne , 2
^i is the IRT ability estimate for examinee i. The vecwhere Ne is the number of examinees and u tor in Equation 2 contains each examinees ability estimate and suggests that Rudners index is a person method. The third vector is a vector of standard error estimates, which can be written as h i0 ^u s ^^ s ^ s ^ 3 ^= s ^ ^ u1 u2 uNe , where Ne is the number of examinees and s ^^ ui is an IRT standard error estimate for examinee i. The standard errors in Equation 3 can be computed from an individuals IRT test information function. In this case, the estimate for the standard error for an examinee is
1 s ^^ ui = q , I ^ ui 4
where I (^ ui ) is the value of the test information function for examinee i. One then finds the area between each successive pair of cut points assuming conditional normality of the standard error estimate around each examinees ability estimate. The normality assumption comes from asymptotic theory and IRT assumptions when using maximum likelihood (ML) estimation, which imply that as the number of items and examinees become large, one should expect that an examinees ML estimate should converge to a normal distribution with a mean of u and a standard deviation of 1 over the reciprocal of the square root of the individuals test information function. The expected probability of scoring in each performancelevel category C based on these assumptions can be written as ^ 5 ui , s ^ ui , piC = f kCi , kCi + 1 , ^ where fa, b, m, s is the area under a normal curve from a to b with a mean of m and a standard deviation of s and the other terms have the same meanings as before. It is important to point out that although assumptions underlying Equation 5 come from asymptotic theory and ML estimation, Equation 5 and the normal distribution assumption can be employed with any proficiency estimator. ^ , that contains the expected One can then define a Ne 3 C matrix of expected probabilities, P probabilities of each examinee falling into each performance level category C. The expected
Wyse and Hao
605
probability that corresponds to the performance level category that the examinee is classified into is assumed to be the expected probability of correct classification, and the other probabilities are assumed to be the expected misclassification probabilities. Define a Ne 3 C matrix of weights, W, which is used to flag the performance-level category that the examinee obtained on the assessment and write the matrix as 2 3 w11 w12 w1C 6 w21 w22 w2C 7 6 7 W=6 . 6 7, . . . . 4 . . . . 5 wNe 1 wNe 2 wNe C where the weight, wci , equals 1 if the examinees score is classified into performance level category C, and 0 otherwise. Rudners expected classification accuracy index can be found by performing element by ele^ with W, taking the sum of all the elements in the resultant matrix, and ment multiplication of P dividing by the number of examinees, Ne . Mathematically, the index can be written as P ^W P ^ t= , 7 Ne where * denotes element by element matrix multiplication. As classification accuracy can be found based on the administration of a single assessment, Equation 7 only contains the matrices ^ and W and does not involve the product of P ^ with itself. P By comparison, classification consistency provides a measure of the proportion of examinees who would be classified into the same category on parallel replications of the same assess^ with itself and does not involve a matrix to flag the ment. This involves taking the product of P observed performance level of the examinee. The new classification consistency index can therefore be expressed as P ^P ^ P ^= , 8 g Ne where * again denotes element by element multiplication and Ne again is the number of examinees. The index in Equation 8 almost seems trivial as it should always be less than or equal to the expected classification accuracy given that Equation 8 involves squaring the elements of ^ matrix. Nonetheless, understanding the relationships between the indices and having a the P classification consistency index that can be computed when data are scored in the u metric is practically useful given that such an index has not been formulated to this point, and that it is common practice to report classification accuracy and consistency following the administration of an assessment.
Guos Indices
Guos (2006) classification accuracy index was originally designed as an extension of Rudners index in the context of ML estimation and it can be loosely viewed as a person-based index. The index makes no assumption of normality of an examinees standard error estimate around ^ matrix based their ability estimate, and calculates expected classification probabilities and the P on individual examinee likelihood functions from IRT models. The avoidance of the normality assumption is an advantage of the method as the normality assumption only holds
606
asymptotically and never is completely satisfied in practice. For dichotomous items, the likelihood function can be written as
Lu1i , u2i , . . . , uni ju =
n Y j=1
Pij uij Q1uij ,
where i is the examinee, j is the item on the test, uij is the response to item j by examinee i with 1 signaling a correct response and 0 signaling an incorrect response, Pij is the probability of a correct response to item j given u, and Qij is the probability of an incorrect response to item j given u which is computed as 1 Pij . Similar likelihood functions can be written out for polytomous items and mixed format tests. The expected probability of scoring in any particular category can be found using the likelihood functions as
kci + 1
Lu1i , u2i , . . . , uni ju , Lu1i , u2i , . . . , uni ju 10
^ pic =
u = kci C +1 k h+1 P P h = 1 u = kh
where u = kci Lu1i , u2i , . . . , uni ju is the sum of likelihood function values from performance category C to next higher performance category C + 1 for a set of equally spaced u points between the cut-scores (e.g., 100 equally spaced points) and the denominator is the sum of the likelihood function values across all performance categories. The fact that Guos method uses sets of equally spaced u points between cut-scores suggests that Equation 2 should not be a vector, but it should be a Ne 3 ((C + 1) 3 NP ) matrix, where C is the number of performance categories and NP is the number of equally spaced points between cut-scores. This suggests that Guos method is not a person method in the traditional sense of how person methods are conceptualized. It is also important to note that to be able to compute the expected probabilities for the highest and lowest categories, the highest and lowest cut-scores in Equation 1 need to be set at arbitrary high and low u values, such as u = 6 and 26. That is, the vector of cut-scores should be expressed as
k = k1 k2 kC + 1 , where k1 \ k2 \ \ kC + 1 and k1 = 6, kC + 1 = 6: 11
Pkci + 1
The use of arbitrary high and low values for the extreme cut-scores instead of positive and negative infinities is also a small difference between the Guo and Rudner approaches. ^ matrix similar to the RudnerThe computations from Equation 10 can then be put into the P based indices, and the weight matrix W can be formulated based on the examinee ability estimates and comparing those estimates with the cut-scores. The classification accuracy index and the new classification consistency index based on Guos formulation can then be determined ^ and W matrices. The relationships between the based on Equations 7 and 8 applied to these P indices again are fairly clear and it can be observed that classification consistency should be less than or equal to classification accuracy.
IRT-Recursive-Based Indices
Similar to the Rudner-based indices, one again starts with a vector of cut-scores when computing the IRT-recursive-based classification indices (Lee, 2010; Lee et al., 2002; Schulz et al.,
Wyse and Hao
607
1999; Wang et al., 2000). However, the vector of cut-scores is expressed in the number-correct score metric instead of the u metric. Denote this vector of cut-scores as
k = k1 k2 kC + 1 , where k1 \ k2 \ \ kC + 1 and k1 = 0, kC + 1 = m: 12
In Equation 12, m represents the maximum possible score on the assessment and each cutscore is assumed to be determined from translating the u cut-score to the number-correct score metric. It is important to recognize that in translating these cut-scores to the number-correct score metric, rounding is needed as the cut-scores in the u metric may not align perfectly with a particular number-correct score. As indices based on IRT-recursive formula can also be computed as a person method, the computation of the indices also includes a vector of ability estimates which is identical to Equation 2. It is also possible to compute the IRT-recursive-based indices using the quadrature points from a run of an IRT software program as a distribution method. However, using the quadrate points is designed to approximate the full vector of ability estimates, and hence, it makes sense to write the indices using the full vector of ability estimates. One uses these ability estimates to create a distribution of the probabilities of receiving each number-correct score using the IRT-recursive formula (Thissen, Pommerich, Billeaud, & Williams, 1995). ui ) as the conditional To write the IRT-recursive formula for dichotomous items, define fn (x^ ^ distribution of number-correct scores over the first n items for an examinee with ability ui , and ^ ij as the probability of a correct response to item j by examinee i. Define f1 (x = 0^ ^ ij ui ) = 1 P P as the probability of earning a score of zero for examinee i on the first item. For n.1, the recursion formula can be written as follows: ^ in x=0 f n x^ ui = fn1 x^ ui 1 P ^ ^ ^ ^ 13 0\x\n fn1 x ui 1 Pin + fn1 x 1 ui Pin ^ ^ x = n: fn1 x 1 ui Pin This formula can also be extended to polytomous items and mixed format tests as is described in Thissen et al. (1995). Then, the probability of scoring in each performance category can be represented as
^ pic =
k C+1 X x = kC
fn X = x^ u :
14
^ Following similar logic as is used with the Rudner-based indices, one forms the matrices P and W and computes classification accuracy and consistency using the formulations in Equations 7 and 8. Again, the relationships between the indices are fairly clear and it can be observed that classification consistency should be less than or equal to classification accuracy.
Similarities and Differences Between Indices

It is important to highlight some of the key similarities and differences between the indices. First, it is apparent that there are different distributional assumptions with each approach. For the Guo-based indices, a single distribution is not assumed and the expected probabilities underlying the indices are driven by the likelihood functions. These likelihood functions are typically not symmetrical and can change depending on the response pattern of the examinee. For the IRT-recursive indices, the distribution of number-correct scores is assumed to follow a compound binomial distribution if all the items are dichotomous or a compound multinomial
608
distribution if the test contains some polytomous items. For the Rudner-based indices, a normal distribution for the examinee ability estimates is assumed when calculating the indices. The different assumptions and formulations may give rise to disparate classification accuracy and classification consistency estimates. However, all the formulations are based on properties of IRT models and typical IRT assumptions. This includes the assumption that the examinee ability estimates and item parameters used are good estimates of the underlying parameters. All three sets of indices can be classified as person-based methods. However, the Guo-based indices are notably different from the Rudner-based or IRT-recursive-based indices. This can be ^ matrix is determined. In the Guo-based indices, each examinees ability estiseen in how the P ^ matrix. But, it is the response mate does not enter into the computations in Equation 10 or the P pattern of the examinee and the equally spaced u points that drive the computations of the likeli^ matrix. This implies that the choice of proficiency estimator will hood function values and the P ^ not affect the P matrix as the response pattern is unchanged across proficiency estimators; only the weight matrix W flagging the observed classifications of the examinees can change with different proficiency estimators. This is in contrast to the IRT-recursive-based indices and the ^ and W matrices can change. In the Rudner-based indices, Rudner-based indices in which the P ^ matrix is based on individual a likelihood function is not applied and the computation of the P examinee test information functions that change in value with different proficiency estimators. Similarly, for the IRT-recursive indices, the likelihood functions are not used and each examinees ability estimate is input into the IRT-recursive formula to determine the probability of receiving each number-correct score given their estimated ability. The important implication of ^ matrix does not change for Guos index is that Guos classification accuracy the fact that P index can be potentially different for various proficiency estimators, but the classification con^ matrix is not chansistency index will be identical across proficiency estimators because the P ged. This is a potential drawback to the Guo-based classification consistency indices as one would expect that the choice of proficiency estimator would affect classification consistency. Another difference is that the Rudner- and Guo-based indices perform computations assuming the reporting metric is the u metric or a linear transformation of this metric, whereas the IRT-recursive-based indices assume that the reporting metric is number-correct scores or a transformation of number-correct scores. This can lead to some small differences in the potential cut-scores as some form of rounding is often needed to translate the cut-score from the u metric to the number-correct score metric.
Simulation Study
Given that the classification consistency indices based on Rudners and Guos formulations are new, an important question is whether the Rudner- and Guo-based indices perform better than the IRT-recursive procedure which is more commonly used to compute classification accuracy and consistency with IRT models. In addition, as the assumptions used with Guo- and Rudnerbased indices are closely tied to ML estimation and each of the three indices make different distributional assumptions, another key question is how using different proficiency estimators affects classification accuracy and consistency indices. Maybe, different indices perform better in different conditions. To investigate these questions, a simulation study was performed in which several different factors were manipulated. Data were simulated for two different ability distributions, two different test lengths, three different sets of cut-scores, and four different proficiency estimators. In the simulation, three cut-scores and four performance categories were assumed in each condition. The number of cut-scores and the number of performance categories are fixed because the effects of the number of cut-scores and performance categories are well known. In particular, it
Wyse and Hao
609
has been shown that classification accuracy and consistency increase as the number of performance categories decreases (Ercikan & Julian, 2002; Lee et al., 2002). Prior investigations with the indices examined in this study indicate that these patterns hold (Lee, 2010; Lee et al., 2002; Martineau, 2007). The number of examinees was also fixed at 2,000 as the preliminary investigations with other sample sizes (e.g., 10,000 and 25,000) produced results that were similar to the 2,000 examinees. A single fixed test form from which the item parameters were drawn in this simulation was also assumed. This test form consisted of 60 three-parameter logistic (3PL) model items that were drawn from an ACT (American College Testing) mathematics test administered to a sample of more than 100,000 students. The estimated parameters from this sample were assumed to be the true known item parameters in the simulation.
Examinee Distributions
Two different examinee distributions were investigated in this study. The first group of examinees was drawn from a normal distribution with a mean of 0 and a standard deviation of 1. The second group of examinees was drawn from a normal distribution with a mean of 0.5 and a standard deviation of 1.25. These two groups of examinees were chosen arbitrarily. The first group was designed to be similar to a typical group of students, and the assumptions used for ability distributions in many software packages when resolving the IRT indeterminacy problem. The second group was designed to represent a group with slightly more ability and greater dispersion. It is expected that the ability distributions would affect the values of the indices and interact with the placement of the cut-scores. When the distribution of examinees is closer to the cut-scores, the values of the indices are expected to decrease, probably in somewhat similar fashion for all three indices.
Test Length
Two different test lengths were included in the simulation. The first test length included the full set of 60 items from the ACT mathematics test. The second test length was 30 items and consisted of the odd items from the ACT mathematics test. It is expected that as the length of the test is shortened, the classification accuracy and classification consistency indices will decrease as examinee ability estimates tend to have more error with shorter test lengths.
Cut-Scores
Three different sets of cut-scores were considered in this study. The first set of cut-scores were u = 20.75, 0.00, and 0.75. These cut-scores were designed to represent a situation in which the cut-scores were symmetrically distributed around 0 and centered on the mean of the first ability distribution. The second set of cut-scores were u = 20.75, 20.35, and 0.75. This allowed the impact of nonsymmetrical cut-scores to be investigated. It is expected that these cut-scores would lower the values of the indices for some examinees between 20.75 and 20.35 as the cut-scores are closer together and raise the value of the indices, and for some examinees between 20.35 and 0.75 as these cut-scores are farther apart. As the distance between cutscores increases, examinees located between the cut-scores should see their classification accuracy and classification consistency estimate rise. However, the value of the indices may be somewhat similar to that when the cuts were set at u = 20.75, 0.00, and 0.75 due to the tradeoff between the individual examinee classification accuracy and consistency estimates at different regions of the scale. The final set of cut-scores investigated were u = 20.827, 20.034, and 0.694 for the 60-item test and u = 20.745, 20.042, and 0.706 for the 30-item test. These
610
cut-scores were included to capture the condition in which u cuts align as closely to a set of number-correct scores as is possible. One might expect that this condition would result in the most similar values for classification accuracy and consistency across the indices as the effect of rounding is essentially removed.
Proficiency Estimators
Four different proficiency estimators were considered in this study. The first proficiency estimator was the IRT true-score (TS) estimator. This estimator was found by estimating the item parameters in BILOG-MG (Zimowski, Muraki, Mislevy, & Bock, 1996) and then applying the NewtonRaphson procedure to each number-correct score to determine each examinees u estimate. This estimator is important to study as the IRT-recursive-based procedure assumes that the reporting metric is number-correct scores or a transformation of number-correct scores. One would expect that the IRT-recursive-based indices would perform better with this estimator than with other estimators because the estimator and the philosophy of the index best align in this case. The other three estimators were the estimators available in BILOG-MG. These are the ML estimator, the expected a posteriori (EAP) estimator, and the maximum a posteriori (MAP) estimator. The EAP and MAP estimators are Bayesian estimators, which tend to be pulled toward the mean of the prior ability distribution in comparison with the ML estimator. Each estimator was computed using the default settings of the BILOG-MG except that the IDIST = 3 option was used with EAP estimator in the SCORE command, the number of quadrate points was increased to 40, the number of expectation-maximization (EM) cycles was increased to 200, and the number of NEWTON cycles was increased to 100. Kolen and Tong (2010) demonstrated that various proficiency estimators can perform differently for classifying students into performance categories in practical contexts and it is expected that these findings would translate to the computation of classification accuracy and consistency indices. One might expect that the ML estimator would perform the best for the Rudner- and Guo-based indices as they have theoretical underpinnings related to ML estimation.
Simulating and Estimating Classification Accuracy and Consistency

To evaluate the performance of each index, the classification accuracy and consistency were simulated and estimated using R. To simulate classification accuracy, the simulated ability distributions were assumed to be the true distributions, and the estimated thetas (i.e., the ^ us) were computed from the item responses generated from the assumed true distributions and were taken as the observed distributions. The cut-scores were then applied to each distribution and the proportion of classifications that remained the same in the observed and true distributions was taken as the simulated classification accuracy. To find the simulated classification consistency, the same true known ability distributions were assumed and two separate sets of item responses were simulated for each group of examinees. The values of the estimators for the two sets of item responses were determined, and the cut-scores were applied to the observed-score distributions. The proportion of classifications that remained the same for the two observed-score distributions was taken as the simulated classification consistency. Estimated classification accuracy and consistency were found by computing the Rudnerbased indices, Guo-based indices, and the IRT-recursive-based indices in R applied to the estimated item and person parameters. For the Guo-based indices, 100 equally spaced u points were used between each pair of cut-scores. The estimated classification accuracy and consistency were contrasted with the simulated classification accuracy and consistency estimates to
Wyse and Hao
611
determine which indices best recovered the simulated values. To provide a baseline condition, the values of indices were also found using the assumed known item parameters and us. For the ^ matrix. The correct classifiGuo index, a set of item responses was simulated to compute the P cations in the W matrix were computed by identifying the performance category in which the likelihood function was maximized based on the simulated item responses. The baseline conditions are labeled No Est. in the tables in the results section as no estimation of item or person parameters was used. Each cut-score in the number-correct score metric needed to apply the IRT-recursive procedure was rounded to the nearest number-correct score when computing the indices. This led to cut-scores on the u scale, which were different for the Rudner- and Guobased indices in comparison with the IRT-recursive-based indices.
Results of Simulation Study

Tables 1 to 4 show the results from the simulation study. Table 1 displays the results when the ability distribution was assumed to be normal with a mean of 0 and a standard deviation of 1 for the 60-item test. Table 2 displays the results when the ability distribution was assumed to be normal with a mean of 0.5 and a standard deviation of 1.25 for the 60-item test. Table 3 displays the results when the ability distribution was assumed to be normal with a mean of 0 and a standard deviation of 1 for the 30-item test. Table 4 displays the results when the ability distribution was assumed to be normal with a mean of 0.5 and a standard deviation of 1.25 for the 30-item test. In each table, the results for the Rudner-based indices are shown at the top of the table, the results for the Guo-based indices are shown in the middle of the table, and the results for the IRT-recursive-based indices are shown at the bottom of the table. The results for the different cut-scores are shown under the three column headings in each table. Several important findings can be observed in the tables. Specifically, the Guo-based indices tended to be the largest, followed by the Rudner-based indices, and the IRT-recursive-based indices. For classification accuracy, the estimated values for the Guo-based indices were often the closest to the simulated values. For classification consistency, the Rudner- or Guo-based indices performed best depending on the estimator. In several cases, the differences between the three indices were trivial with differences in the third decimal place. However, there were some differences that approached 0.04 or 0.05. Differences of 0.05 between the indices might be viewed as somewhat large given that the indices are restricted to a range of 0.00 to 1.00. This suggests that the index chosen to report classification accuracy and consistency can have key impacts on the numbers that are reported. In addition, it is important to notice that in the case of no estimation of item or person parameters, the value for the ML estimator was closest to the values computed for the Rudner- and Guo-based indices. This suggests that the ML-based estimates of the indices were very close to the value of the indices with no estimation of item and person parameters. This is somewhat expected as the Rudner- and Guo-based indices are closely tied to assumptions for ML estimation. For the IRT-recursive indices with no estimation, there was not a single proficiency estimator that was closest to the no estimation condition across the tables. It can also be seen that the indices tended to be closest in value with the TS estimator in comparison with the other proficiency estimators. In addition, the tables suggest that the Rudners classification accuracy and consistency indices tended to be greatest for EAP and MAP estimators, the Guos classification accuracy indices tended to be greatest for the ML and EAP estimators, and the IRT-recursive classification accuracy and consistency indices tended to be greatest for the TS and ML estimators. The Guo-based classification consistency index did not change across proficiency estimators as was expected. Clearly, it is possible for interactions to exist
612
Cuts (u = 20.75, 20.35, 0.75) Simulated accuracy 0.823 0.845 0.852 0.852 0.823 0.845 0.852 0.852 0.823 0.845 0.852 0.852 0.755 0.783 0.789 0.795 0.755 0.783 0.789 0.795 0.822 0.848 0.850 0.851 0.822 0.848 0.850 0.851 0.755 0.783 0.789 0.795 0.822 0.848 0.850 0.851 Estimated accuracy Simulated consistency Estimated consistency Simulated accuracy Estimated accuracy Cuts (u = 20.827, 20.034, 0.694) Simulated consistency 0.743 0.771 0.781 0.781 0.743 0.771 0.781 0.781 0.743 0.771 0.781 0.781 Estimated consistency Simulated consistency 0.756 0.784 0.788 0.793 0.756 0.784 0.788 0.793 0.756 0.784 0.788 0.793 Estimated consistency 0.763 0.768 0.776 0.780 0.770 0.804 0.804 0.804 0.804 0.800 0.763 0.758 0.742 0.759 0.748 0.823 0.833 0.839 0.846 0.836 0.819 0.852 0.844 0.847 0.858 0.817 0.815 0.805 0.814 0.806 0.757 0.768 0.774 0.781 0.771 0.806 0.806 0.806 0.806 0.802 0.761 0.764 0.751 0.760 0.752 0.831 0.839 0.844 0.845 0.836 0.819 0.850 0.849 0.853 0.857 0.826 0.818 0.806 0.808 0.817 0.766 0.770 0.778 0.779 0.767 0.798 0.798 0.798 0.798 0.798 0.764 0.761 0.753 0.754 0.748
Table 1. Simulated and Estimated Classification Accuracy and Consistency for N(0,1) Ability Distribution for 60 Items
Cuts (u = 20.75, 0.00, 0.75)
Index
Estimator
Simulated accuracy
Estimated accuracy
Rudner
0.832 0.847 0.856 0.856
Guo
0.832 0.847 0.856 0.856
Recursive
TS ML MAP EAP No Est. TS ML MAP EAP No Est. TS ML MAP EAP No Est.
0.832 0.847 0.856 0.856
0.829 0.836 0.843 0.848 0.838 0.822 0.852 0.834 0.839 0.859 0.823 0.819 0.811 0.819 0.806
Note: IRT = item response theory; TS = IRT true-score estimator; ML = IRT maximum likelihood estimator; MAP = IRT maximum a posteriori estimator; EAP = IRT expected a posteriori estimator; No Est. = value of index assuming no estimation of the item ability or person parameters. Rudner is the calculation of the index based on Rudners formulation, Guo is the calculation of the index based on Guos formulation, and recursive is the calculation of the index based on the IRT-recursive-based formulation.
Table 2. Simulated and Estimated Classification Accuracy and Consistency for N(0.5,1.25) Ability Distribution for 60 Items
Cuts (u = 20.75, 20.35, 0.75) Simulated accuracy 0.879 0.891 0.892 0.889 0.879 0.891 0.892 0.889 0.879 0.891 0.892 0.889 0.824 0.841 0.839 0.852 0.824 0.841 0.839 0.852 0.869 0.884 0.876 0.882 0.869 0.884 0.876 0.882 0.824 0.841 0.839 0.852 0.869 0.884 0.876 0.882 Estimated accuracy Simulated consistency Estimated consistency Simulated accuracy Estimated accuracy Cuts (u = 20.827, 20.034, 0.694) Simulated consistency 0.812 0.825 0.835 0.834 0.812 0.825 0.835 0.834 0.812 0.825 0.835 0.834 Estimated consistency
Cuts (u = 20.75, 0.00, 0.75) Simulated consistency 0.815 0.833 0.843 0.841 0.815 0.833 0.843 0.841 0.815 0.833 0.843 0.841 Estimated consistency
Index
Estimator
Simulated accuracy
Estimated accuracy
Rudner
0.873 0.881 0.889 0.882
Guo
0.873 0.881 0.889 0.882
Recursive
0.873 0.881 0.889 0.882
0.861 0.871 0.880 0.879 0.872 0.871 0.887 0.869 0.877 0.889 0.863 0.857 0.850 0.857 0.851
0.801 0.815 0.826 0.825 0.816 0.844 0.844 0.844 0.844 0.843 0.815 0.814 0.804 0.812 0.806
0.862 0.872 0.876 0.879 0.874 0.869 0.888 0.882 0.888 0.889 0.864 0.860 0.853 0.859 0.854
0.806 0.820 0.825 0.828 0.822 0.848 0.848 0.848 0.848 0.846 0.820 0.821 0.811 0.818 0.813
0.861 0.872 0.878 0.877 0.872 0.868 0.887 0.877 0.882 0.889 0.860 0.861 0.860 0.861 0.861
0.803 0.817 0.824 0.822 0.817 0.844 0.844 0.844 0.844 0.843 0.813 0.813 0.810 0.809 0.807
613
614
Cuts (u = 20.75, 20.35, 0.75) Simulated accuracy 0.779 0.812 0.817 0.821 0.779 0.812 0.817 0.821 0.779 0.812 0.817 0.821 0.703 0.735 0.744 0.759 0.703 0.735 0.744 0.759 0.774 0.802 0.805 0.812 0.774 0.802 0.805 0.812 0.703 0.735 0.744 0.759 0.774 0.802 0.805 0.812 Estimated accuracy Simulated consistency Estimated consistency Simulated accuracy Cuts (u = 20.745, 20.042, 0.706) Estimated accuracy Simulated consistency 0.679 0.720 0.718 0.733 0.679 0.720 0.718 0.733 0.679 0.720 0.718 0.733 Estimated consistency Simulated consistency 0.680 0.715 0.720 0.734 0.680 0.715 0.720 0.734 0.680 0.715 0.720 0.734 Estimated consistency 0.686 0.707 0.713 0.714 0.701 0.756 0.756 0.756 0.756 0.751 0.703 0.699 0.763 0.758 0.687 0.772 0.795 0.797 0.803 0.792 0.789 0.819 0.819 0.824 0.826 0.778 0.765 0.759 0.767 0.753 0.698 0.717 0.716 0.716 0.712 0.767 0.767 0.767 0.767 0.761 0.725 0.719 0.705 0.712 0.694 0.766 0.790 0.798 0.800 0.787 0.785 0.814 0.811 0.817 0.821 0.767 0.763 0.761 0.764 0.766 0.683 0.704 0.712 0.712 0.699 0.752 0.752 0.752 0.752 0.747 0.703 0.699 0.695 0.697 0.687
Table 3. Simulated and Estimated Classification Accuracy and Consistency for N(0,1) Ability Distribution for 30 Items
Cuts (u = 20.75, 0.00, 0.75)
Index
Estimator
Simulated accuracy
Estimated accuracy
Rudner
0.779 0.807 0.801 0.807
Guo
0.779 0.807 0.801 0.807
Recursive
0.779 0.807 0.801 0.807
0.766 0.792 0.798 0.800 0.789 0.783 0.816 0.805 0.811 0.822 0.768 0.762 0.762 0.763 0.744
Table 4. Simulated and Estimated Classification Accuracy and Consistency for N(0.5,1.25) Ability Distribution for 30 Items
Cuts (u = 20.75, 20.35, 0.75) Simulated accuracy 0.831 0.848 0.847 0.850 0.831 0.848 0.847 0.850 0.831 0.848 0.847 0.850 0.778 0.802 0.799 0.795 0.778 0.802 0.799 0.795 0.820 0.843 0.838 0.840 0.820 0.843 0.838 0.840 0.778 0.802 0.799 0.795 0.820 0.843 0.838 0.840 Estimated accuracy Simulated consistency Estimated consistency Simulated accuracy Estimated accuracy Cuts (u = 20.745, 20.042, 0.706) Simulated consistency 0.762 0.796 0.781 0.778 0.762 0.796 0.781 0.778 0.762 0.796 0.781 0.778 Estimated consistency
Cuts (u = 20.75, 0.00, 0.75) Simulated consistency 0.762 0.783 0.782 0.780 0.762 0.783 0.782 0.780 0.762 0.783 0.782 0.780 Estimated consistency
Index
Estimator
Simulated accuracy
Estimated accuracy
Rudner
0.811 0.834 0.835 0.838
Guo
0.811 0.834 0.835 0.838
Recursive
0.811 0.834 0.835 0.838
0.770 0.828 0.834 0.831 0.825 0.825 0.848 0.833 0.838 0.853 0.817 0.805 0.800 0.795 0.805
0.699 0.758 0.763 0.758 0.749 0.795 0.795 0.795 0.795 0.794 0.769 0.767 0.759 0.758 0.758
0.775 0.837 0.837 0.838 0.830 0.834 0.855 0.853 0.858 0.859 0.823 0.820 0.810 0.810 0.814
0.712 0.774 0.771 0.771 0.761 0.808 0.808 0.808 0.808 0.805 0.784 0.786 0.770 0.772 0.767
0.771 0.829 0.834 0.830 0.825 0.827 0.850 0.835 0.837 0.852 0.817 0.816 0.810 0.807 0.822
0.700 0.759 0.763 0.758 0.750 0.796 0.796 0.796 0.796 0.794 0.769 0.767 0.759 0.756 0.759
615
616
between the proficiency estimator that is chosen and the index selected to report classification accuracy or consistency. Tables 1 through 4 also indicate that in many situations, the estimated classification accuracy and consistency were lower than the values simulated for the Rudner-based and IRT-recursivebased indices with a couple of exceptions for a few of the computations with the TS estimator. The Guo-based classification accuracy indices tended to be lower than the simulated values for the TS, EAP, and MAP estimators and larger for the ML estimator for the 60-item test. For the 30-item test, the estimated classification accuracy exceeded the simulated classification accuracy in most situations. The Guo-based classification consistency indices were higher than the simulated classification consistency in almost all cases. The simulated values tended to be best recovered with the TS estimators, although there were a few exceptions when the cut-scores were at u = 20.827, 20.034, and 0.694 where some other proficiency estimators were better recovered, and with the Guo-based indices with the cuts at u = 20.75, 20.35, and 0.75 where the ML estimator performed the best. In terms of test length, the results follow what one would expect with the classification accuracy and classification consistency dropping quite a bit across the board between Tables 1 and 3 and Tables 2 and 4 when the test length was 60 items compared with 30 items. When looking at these tables, one also notices that the differences between the TS estimator and the EAP, MAP, and ML estimators tended to become larger as the test length was decreased for the Rudnerand Guo-based indices. In addition, the differences between the IRT-recursive indices and the other indices also tended to increase when test length was decreased. This suggests that there are important potential interactions between the length of the test, different proficiency estimators, and the index that one chooses to employ. In terms of the different cut-scores, Table 1 suggests that when the distribution was assumed to be normal with a mean of 0 and a standard deviation of 1 for the 60-item test, the values of the Rudner-based indices tended to be the least when the cut-scores were at u = 20.75, 20.35, and 0.75, and they tended to be the greatest when the cut-scores were at u = 20.827, 20.034, and 0.694. For the Guo-based classification accuracy indices, the TS and ML estimators were greatest for u = 20.75, 20.35, and 0.75, and least for u = 20.827, 20.034, and 0.694. For the EAP and MAP estimators, the u = 20.827, 20.034, and 0.694 cuts produced the highest classification accuracy estimates and u = 20.75, 0.00, and 0.75 produced the least. The Guos classification consistency indices were greatest for all estimators when the cuts were at for u = 20.75, 20.35, and 0.75. For the IRT-recursive-based indices, the pattern was slightly different where except for the TS estimator, the u = 20.75, 0.00, and 0.75 cuts produced the highest value of the indices. For the TS estimator, the u = 20.827, 20.034, and 0.694 cuts produced the highest classification accuracy and consistency. The patterns were not as clear and consistent when the ability distribution was assumed to be normal with a mean of 0.5 and a standard deviation of 1.25 for the 60-item test (see Table 2). In this case, many of the estimated values of classification accuracy and consistency were very similar and trivially different across cut-scores. For the 30-item tests (see Tables 3 and 4), the cuts at u = 20.75, 20.35, and 0.75 tended to produce the highest classification accuracy and consistency for all three sets of indices for both distributions of examinees. It is also important to notice that when the ability distribution had a mean of 0.5 and a standard deviation of 1.25 as opposed to a mean of 0 and a standard deviation of 1, the values of the indices rose across the board. This is consistent with the understanding that as the ability of the examinees moves away from the cut-scores classification accuracy and classification consistency goes up. Figures 1 and 2 provide pictures of the classification accuracy and classification consistency for the three different indices at various u locations. Figure 1 is for the 60-item test and Figure 2 is for the 30-item test. The figures do not assume a particular proficiency estimator and were
Wyse and Hao
617
Figure 1. Plot of classification accuracy and consistency curves for simulations with 60 items
Note: The solid line in each panel is for the Rudner-based index, the dotted line is for the Guo-based index, and the dashed line is for the IRT-recursive-based index. The top left panel is the classification accuracy curves with cuts at u = 20.75, 0.00, and 0.75; the top middle panel is the classification accuracy curves with cuts at u = 20.75, 20.35, and 0.75; the top right panel is the classification accuracy curves with cuts at u = 20.827, 20.034, and 0.694; the bottom left panel is the classification consistency curves with cuts at u = 20.75, 0.00, and 0.75; the bottom middle panel is the classification consistency curves with cuts at u = 20.75, 20.35, and 0.75; and the top bottom right panel is the classification consistency curves with cuts at u = 20.827, 20.034, and 0.694.
created based on the assumed known item parameters for the 30- and 60-item tests. The x-axis is the examinees u and the y-axis is the value of the classification accuracy or the classification consistency for that u. The solid lines show the Rudner-based index, the jagged dotted lines show the Guo-based index, and the dashed lines show the IRT-recursive-based index. The lines for the Guo-based indices are not smooth due to the simulation of item responses for examinees at each u needed to calculate the indices. For the Rudner-based and IRT-recursive-based indices, each u can be applied in conjunction with the known item parameters without simulating item responses creating a smooth line. The pictures clearly show the different functional forms of each of the indices and what the value of each index would be for an examinee at each u value. The figures help to explain some of the findings in Tables 1 through 4. In particular, it appears that the Guo-based indices have a slightly different pattern in terms of the value of the indices for examinees at different us than the Rudner- and IRT-based indices as they do not go up as high in between cut-scores or as low at the cut-scores as the other two indices. This is probably due in part to the use of the item response patterns and the focus on likelihood functions instead of examinee us when computing the indices. One can also see that the Rudnerbased indices exceeded the IRT-recursive-based indices in between the cut-scores. These two indices have dips for the cut-scores in Figures 1 and 2 that do not align exactly for the first two panels in each figure due to the rounding of the cut-scores. At the extremes of the u distribution, the Guo-based and IRT-recursive-based indices had higher classification accuracy and consistency. This makes sense because for the Rudner-based indices having an extreme score often was associated with having an extremely low value of the test information function which would lower the classification accuracy and consistency. The IRT-recursive-based and Guobased indices, however, do not consider test information, and extreme scores were associated
618
Figure 2. Plot of classification accuracy and consistency curves for simulations with 30 items
Note: The solid line in each panel is for the Rudner-based index, the dotted line is for the Guo-based index, and the dashed line is for the IRT-recursive-based index. The top left panel is the classification accuracy curves with cuts at u = 20.75, 0.00, and 0.75; the top middle panel is the classification accuracy curves with cuts at u = 20.75, 20.35, and 0.75; the top right panel is the classification accuracy curves with cuts at u = 20.745, 20.042, and 0.706; the bottom left panel is the classification consistency curves with cuts at u = 20.75, 0.00, and 0.75; the bottom middle panel is the classification consistency curves with cuts at u = 20.75, 20.35, and 0.75; and the top bottom right panel is the classification consistency curves with cuts at u = 20.745, 20.042, and 0.706.
with higher classification accuracy and consistency. This is an important difference that is worth noting and suggests that with distribution of examinees with more extreme scores that Rudner-based indices would probably be lower than the other two indices. This is a potential downside to the Rudner-based indices as one would anticipate that the probability of accurately and consistently classifying an examinee with an extreme score would be high. In many practical situations, most of the examinees will often be in regions where the cut-scores are located, and one would probably expect that the Rudner-based indices would work well as they did in the simulations.
Michigan Merit Examination (MME) Data

Data for the practical examples were drawn from the MME. The MME is a large-scale assessment given to 11th graders and some eligible 12th graders that is used for school accountability and adequate yearly progress determinations in Michigan. The MME has five subject tests (reading, math, science, writing, and social studies) consisting of items from the ACT, WorkKeys, and custom-Michigan-developed components. Subsets of items are selected from ACT and WorkKeys along with the Michigan-developed components to align with Michigans high school academic content standards. These items are used to determine an examinees score in each subject. Data from the MME reading and math tests are considered in the examples in this article. The MME reading test consists of 51 operational multiple-choice items: 32 of the items come from the ACT reading test and 19 of the items come from the WorkKeys reading for information test. An examinees reported score is a linear transformation of his or her u estimate from
Wyse and Hao

Table 5. Estimated Classification Accuracy and Consistency for MME Reading and Math Tests Reading (n = 98,423) Index Rudner Estimator TS ML MAP EAP TS ML MAP EAP TS ML MAP EAP Accuracy 0.829 0.821 0.810 0.807 0.799 0.821 0.801 0.811 0.800 0.801 0.792 0.788 Consistency 0.763 0.761 0.750 0.746 0.759 0.759 0.759 0.759 0.744 0.740 0.730 0.725 Math (n = 97,888) Accuracy 0.800 0.806 0.792 0.800 0.792 0.817 0.814 0.817 0.782 0.783 0.763 0.772
619
Consistency 0.727 0.734 0.717 0.726 0.758 0.758 0.758 0.758 0.720 0.721 0.696 0.705
Guo
Recursive
Note: MME = Michigan Merit Examination; IRT = item response theory; TS = IRT true-score estimator; ML = IRT maximum likelihood estimator; MAP = IRT maximum a posteriori estimator; EAP = IRT expected a posteriori estimator. Rudner is the calculation of the index based on Rudners formulation, Guo is the calculation of the index based on Guos formulation, and recursive is the calculation of the index based on the IRT-recursive-based formulation.
applying the 3PL model to these data. The 3PL model exhibited moderate degrees of misfit. The MME technical report puts misfit at roughly 43% of the items not fitting the model when using the S X 2 fit statistic of Orlando and Thissen (2000). There were 98,423 examinees who received valid scores on the initial form of the MME reading test that were considered in this article. The estimated reliability for these data was .89. The MME math test is made up of 67 operational multiple-choice items: 3 of the items come from the WorkKeys locating information test, 12 of the items come from the WorkKeys applied mathematics test, 36 of the items come from the ACT mathematics test, and the remaining items are custom-developed items. Scores reported to examinees again are a linear transformation of each examinees u estimate from applying 3PL model to these data. Model fit using the S X 2 fit statistic reported in the MME technical report was 28% of the items not fitting the model. There were 97,888 examinees who received valid scores on the MME math test considered in this article. The estimated reliability for these data was .87.
Results for MME Data

Table 5 displays the results for the classification accuracy and consistency for the MME reading and math tests for the three cut-scores that are used to make classification decisions on each assessment. The results for the Rudner-based indices are shown at the top of the table, the results for Guo-based indices are shown in the middle of the table, and the results for the IRTrecursive-based indices are shown at the bottom of the table. For all three indices, the classification accuracy and consistency were higher for the reading test compared with the math test, except for the MAP and EAP estimators for the Guo-based classification accuracy index. For the MME math test, the results were similar to the simulation, where the Guobased indices tended to exceed the Rudner-based indices, which exceeded the IRT-recursivebased indices. The TS estimator had a larger classification accuracy value for the Rudner-based indices than for the Guo-based indices for these data. For the MME reading test, the results
620
Figure 3. Plot of classification accuracy and consistency curves for MME reading and math
Note: MME = Michigan Merit Examination; IRT = item response theory. The solid line in each panel is for the Rudnerbased index, the dotted line is for the Guo-based index, and the dashed line is for the IRT-recursive-based index. The top left panel is the classification accuracy curve with four performance levels for reading, the top right panel is the classification consistency curve with four performance levels for reading, the bottom right panel is the classification accuracy curve with four performance levels for math, and the bottom left panel is the classification consistency curve with four performance levels for math.
were different, in some cases, the Rudner-based indices were larger than the Guo-based indices for classification accuracy and consistency. The IRT-recursive-based indices were again the smallest. The largest differences between the three indices across the proficiency estimators were around 0.03 for the MME reading test and 0.05 for the MME math test. These levels of differences were somewhat similar to some of the differences observed in the simulation. Somewhat different from the simulation was the rank ordering of values of the indices across the proficiency estimators. In the simulation, the EAP and MAP estimators tended to have the highest values for the Rudner-based indices. However, in the practical examples, the EAP and MAP estimators had values that were lower than those for the TS and ML estimators for the Rudner-based indices. The EAP and MAP estimators also had lower classification accuracy and consistency estimates for the IRT-based recursive indices. The ML estimator again had the highest value for the Guo-based classification accuracy indices, and the classification consistency was the same for all estimators. Figure 3 graphically displays the classification accuracy and classification consistency for the indices at various u locations for the MME reading and math tests similar to Figures 1 and 2. The top panels are for the MME reading tests and the bottom panels are for the MME math tests. The solid lines in the panels are for the Rudner-based indices, the jagged dotted lines are for the Guo-based indices, and the dashed lines are for the IRT-based recursive indices. The dips in the figures for Rudner-based indices show the placements of the cut-scores. These dips do not lie directly on top of each other for the Rudner- and IRT-based indices due to the rounding needed to calculate the IRT-based recursive indices in the number-correct score metric. The spread and placement of these cut-scores were disparate on both tests. For the reading test, the cut-scores were more spread apart, and for the math test, the cut-scores were closer together. The figures show the impact of these cut-score placements on the value of indices. When the cut-scores were closer together, classification accuracy and consistency for individual examinees tended to
Wyse and Hao
621
decrease in comparison with when they were further apart. The panels also show that between the first and second cut-score, the Rudner-based indices tended to exceed the IRT-based recursive indices. The lines for the Guo-based indices again had a different pattern than the Rudnerbased or IRT-recursive indices with dips not going as low and the peaks not going as high. More notable differences for the Guo-based indices were found for the MME reading test compared with the MME math test. For the MME reading test, the peak between the first and second cut-scores is associated with a smooth dip and lower classification accuracy and consistency for the Guo-based index in comparison with the Rudner-based index. As many examinees had scores that were between these cut-scores, this may explain why the Rudner-based indices were in some cases higher for these data in Table 5.
Discussion and Conclusion

The purposes of this article were (a) to introduce classification consistency indices based on Rudners and Guos formulations and (b) to evaluate the performance of the Rudner-based, Guo-based, and IRT-recursive-based indices in simulation and practice. The development of these new indices is important because many of the current approaches for calculating classification accuracy and consistency assume that the reporting metric is number-correct scores or a transformation of number-correct scores, which may not be the approach used to determine scores when IRT models are applied. This can lead to small subtle differences in the cut-scores in the IRT u metric due to the rounding needed to compute the indices. The Rudner- and Guobased indices do not make the assumption that the reporting metric is number-correct scores and can be applied when tests are scored in the u metric or a linear transformation of this metric. The Rudner and Guo indices also are computationally easier to compute and are closely tied to assumptions used with ML estimation, which is an often used approach for estimating examinee abilities with IRT models. Despite the conceptual and practical appeal of the Rudner- and Guo-based indices, the performance of these indices with various proficiency estimators and across a variety of conditions has not been fully investigated. To date, only Martineau (2007) and Wyse (2011) have looked at the performance of Rudners classification accuracy index considering some of the factors that can affect the index. However, these articles did not consider the classification consistency index, did not look at the interaction of the indices with various proficiency estimators, and did not look at how the Rudner-based indices perform in comparison with other commonly used classification accuracy and consistency indices. The Guo-based classification index has only been compared with the Rudner-based index using a practical example when the index was originally formulated and has not been compared with the Rudner-based or IRT-recursive indices in a systematic way. This study provided an initial investigation of a few of these factors in a simulated and practical setting. Results from these investigations suggested that the Guo-based indices tended to have the highest classification accuracy and consistency, followed by the Rudner-based indices and the IRT-based recursive indices. The Guos classification accuracy index and the Guo- and Rudnerbased classification consistency indices performed the best for recovering classification accuracy and consistency. The differences among the three indices were often small and some cases trivially different, but there were some differences on the magnitude of 0.04 or 0.05 units for the whole population. This finding is important, especially given that Lee (2010) has observed that IRT-based recursive indices tend to be also higher than values estimated with the non-IRTbased Livingston and Lewis (1995) procedure. This may suggest that there may be even larger differences between Guo- and Rudner-based indices and those from the Livingston and Lewis procedure. Future research could compare these indices in a variety of situations. This research
622
would be valuable as the simulation in this study, although designed to look at a variety of factors that can affect the indices, may not reflect the full range of factors that affect the indices in all situations. It is possible that the indices may perform differently with alternate tests, different placements of cut-scores, and various other factors, such as skewed score distributions. The results from the simulations and investigations also suggest some potential features of the Rudner- and Guo-based indices that should be highlighted. Specifically, the Guo-based classification consistency tended to be notably higher than other indices and did not change with different proficiency estimators. This suggests that one should use caution when applying the Guos classification consistency index, and the Guo-based formulation may be better when investigations are focused only on classification accuracy or a single proficiency estimator, such as the ML estimator. In terms of the Rudner-based indices, results suggested that the indices may be adversely affected when the examinee distribution contains more examinees with extreme scores as extremes scores often have less test information. This may suggest that in these situations, one may want to consider the application of another index. A notable finding of this study was that the values of the classification accuracy and classification consistency indices can change for various proficiency estimators. These findings are similar to those in Kolen and Tong (2010), who observed that the choice of different proficiency estimators can affect the number of students reported in different performance levels. It is well known that alternate proficiency estimators have different properties and that choosing different estimators can change examinee ability estimates and classifications. This article highlights that the choice of proficiency estimator can also affect the value of the classification accuracy and consistency. Additional research could evaluate classification accuracy and consistency with different proficiency estimators in other contexts. This study also highlights the benefit of creating classification accuracy and classification plots, like those in Figures 1 to 3, when investigating different classification indices. These plots allow the researcher and the practitioner to look across the range of possible scores and identify regions in which the indices are performing differently. These pictures can also help identify possible explanations for why the indices tended to produce disparate values in simulated and practical settings. In this article, the figures helped to show some of the differences in how the Rudner-based indices treated extreme scores; the Rudner-based indices tended to have lower classification accuracy and consistency for extreme scores because these scores tended to be associated with lower values of the IRT test information function. The figures also depicted some differences in the cut-scores due to the rounding of scores that was needed with IRTrecursive-based procedure as well as higher values for the Rudner-based indices between the cut-scores. One can also see some of the fundamental differences between the Guo-based and the other indices due to the use of likelihood functions and response patterns with the Guo indices. The Guo indices produced graphs that were less smooth and in which the peaks and valleys between and at cut-scores were less pronounced. In the simulation, this produced results in which the Guo-based indices often exceeded the Rudner- and IRT-based indices. However, as the MME reading practical example suggested, it will not always hold that the Guo-based indices will be larger than the other indices. The figures do suggest that it is possible for the indices to be higher or lower depending on the distribution of examinee scores. It is also important to point out the rather obvious observation that there are a number of factors that can affect classification accuracy and consistency. Some of the factors that can affect the classification accuracy and consistency values estimated include the classification accuracy or consistency index chosen, the distribution of examinee performance, the number and placement of the cut-scores, the proficiency estimator and scoring metric chosen, the properties and number of items in the assessment, and the models applied to the test data as well as the fit of those models. It is also important to note the relationships that exist between classification
Wyse and Hao
623
accuracy and classification consistency. As Equations 7 and 8 suggest, classification accuracy should be greater than or equal to classification consistency given that classification accuracy involves computations assuming a single administration of the assessment, and classification consistency involves administrations of parallel forms or the squaring of computations from a single administration to approximate the similarity of classifications across forms. Both types of indices can be valuable, but the indices address different questions. This suggests that depending on the situation and question of interest related to the classification decisions that one index or the other may be more appropriate for that question. This implies that reporting of both indices may not always be necessary. It also means that a critical consideration is having classification accuracy and consistency indices that come from the same foundation that can be applied to the same data because the question asked may better fit one type of index or the other. Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
References
Brennan, R. L., & Wan, L. (2004). Bootstrap procedures for estimating decision consistency for singleadministration complex assessments (CASMA Research Report No. 7). Iowa City: Center for Advanced Studies in Measurement and Assessment, University of Iowa. Ercikan, K., & Julian, M. (2002). Classification accuracy of assigning student performance to proficiency levels: Guidelines for assessment design. Applied Measurement in Education, 15, 269-294. Guo, F. (2006). Expected classification accuracy using the latent distribution. Practical Assessment Research & Evaluation, 11(6). Retrieved from http://pareonline.net/pdf/v11n6.pdf Hanson, B. A., & Brennan, R. L. (1990). An investigation of classification consistency indexes estimated under alternative strong true score models. Journal of Educational Measurement, 27, 345-359. Huynh, H. (1976). On the reliability of decisions in domain-referenced testing. Journal of Educational Measurement, 13, 253-264. Kolen, M. J., & Tong, Y. (2010). Psychometric properties of IRT proficiency estimates. Educational Measurement: Issues and Practice, 29, 8-14. Lee, W. (2010). Classification consistency and accuracy for complex assessments using item response theory. Journal of Educational Measurement, 47, 1-17. Lee, W., Brennan, R. L., & Wan, L. (2009). Classification consistency and accuracy for complex assessments under the compound multinomial model. Applied Psychological Measurement, 33, 374-390. Lee, W., Hanson, B. A., & Brennan, R. L. (2002). Estimating consistency and accuracy indices for multiple classifications. Applied Psychological Measurement, 26, 412-432. Livingston, S. A., & Lewis, C. (1995). Estimating the consistency and accuracy of classifications based on test scores. Journal of Educational Measurement, 32, 179-197. Martineau, J. A. (2007). An expansion and practical evaluation of expected classification accuracy. Applied Psychological Measurement, 31, 181-194. Orlando, M., & Thissen, D. (2000). Likelihood-based item-fit indices for dichotomous item response theory models. Applied Psychological Measurement, 24, 50-64. Rudner, L. M. (2001). Computing the expected proportions of misclassified examinees. Practical Assessment Research & Evaluation, 7(14). Retrieved from http://PAREonline.net/getvn.asp?v=7&n=14 Rudner, L. M. (2005). Expected classification accuracy. Practical Assessment Research & Evaluation, 10(13). Retrieved from http://pareonline.net/pdf/v10n13.pdf
624
Schulz, E. M., Kolen, M. J., & Nicewander, W. A. (1999). A rationale for defining achievement levels using IRT-estimated domain scores. Applied Psychological Measurement, 23, 347-362. Subkoviak, M. J. (1976). Estimating reliability from a single administration of a criterion-referenced test. Journal of Educational Measurement, 13, 265-276. Thissen, D., Pommerich, M., Billeaud, K., & Williams, V. S. L. (1995). Item response theory for scores on tests including polytomous items with ordered responses. Applied Psychological Measurement, 19, 39-49. Wang, T., Kolen, M. J., & Harris, D. J. (2000). Psychometric properties of scale scores and performance levels for performance assessments using polytomous IRT. Journal of Educational Measurement, 37, 141-162. Wyse, A. E. (2011). The potential impact of not being able to create parallel tests on expected classification accuracy. Applied Psychological Measurement, 35, 110-126. Zimowski, M. F., Muraki, E., Mislevy, R. J., & Bock, R. D. (1996). BILOG-MG: Multiple Group IRT Analysis and Test Maintenance for Binary Items [Computer program]. Chicago, IL: Scientific Software International.

An Evaluation of Item Response Theory Classification Accuracy and Consistency Indices

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

An Evaluation of Item Response Theory Classification Accuracy and Consistency Indices

Uploaded by

Copyright:

Available Formats

Article

An Evaluation of Item Response Theory Classification Accuracy and Consistency Indices

Wyse and Hao

Applied Psychological Measurement 36(7)

Classification Accuracy and Consistency Indices Rudner-Based Indices

Wyse and Hao

Applied Psychological Measurement 36(7)

Pij uij Q1uij ,

Lu1i , u2i , . . . , uni ju , Lu1i , u2i , . . . , uni ju 10

Wyse and Hao

Similarities and Differences Between Indices

Applied Psychological Measurement 36(7)

Wyse and Hao

Applied Psychological Measurement 36(7)

Simulating and Estimating Classification Accuracy and Consistency

Wyse and Hao

Results of Simulation Study

Cuts (u = 20.75, 0.00, 0.75)

0.832 0.847 0.856 0.856

0.832 0.847 0.856 0.856

TS ML MAP EAP No Est. TS ML MAP EAP No Est. TS ML MAP EAP No Est.

0.832 0.847 0.856 0.856

0.873 0.881 0.889 0.882

0.873 0.881 0.889 0.882

TS ML MAP EAP No Est. TS ML MAP EAP No Est. TS ML MAP EAP No Est.

0.873 0.881 0.889 0.882

Cuts (u = 20.75, 0.00, 0.75)

0.779 0.807 0.801 0.807

0.779 0.807 0.801 0.807

TS ML MAP EAP No Est. TS ML MAP EAP No Est. TS ML MAP EAP No Est.

0.779 0.807 0.801 0.807

0.811 0.834 0.835 0.838

0.811 0.834 0.835 0.838

TS ML MAP EAP No Est. TS ML MAP EAP No Est. TS ML MAP EAP No Est.

0.811 0.834 0.835 0.838

Applied Psychological Measurement 36(7)

Wyse and Hao

Applied Psychological Measurement 36(7)

Michigan Merit Examination (MME) Data

Wyse and Hao

Results for MME Data

Applied Psychological Measurement 36(7)

Wyse and Hao

Discussion and Conclusion

Applied Psychological Measurement 36(7)

Wyse and Hao

Applied Psychological Measurement 36(7)

You might also like