You are on page 1of 3

Linacre (1994) states that a sample size of 27-61 is appropriate depending on the

targeting. That is if the test is well targetted on the students a sample of 27 is


enough, if the test is off-target 61 is ok.
Fot polytomies like your work he says the
sample could even be smaller because there
is more information in polytomoies.But you Polytomies
need at least 10 observations per category.
http://www.rasch.org/rmt/rmt74m.htm
The extra concern with
Sample Size and polytomies is that you
need at least 10
Item Calibration observations per category,
see, for instance, Linacre
Stability J.M. (2002) Understanding
Rasch measurement:
How big a sample is necessary to obtain Optimizing rating scale
usefully stable item calibrations? category effectiveness.
Each time we calibrate a set of items on Journal of Applied
different samples of similar examinees, we Measurement 3:1 85-106.
expect slightly different results. In principle, as or Linacre J.M. (1999)
the size of the samples increases, the Investigating rating scale
differences become smaller. If each sample category utility. Journal of
were only 2 or 3 examinees, results could be Outcome Measurement
very unstable. If each sample were 2,000 or 3:2, 103-122.
3,000 examinees, results might be essentially
identical, provided no other sources of error Otherwise the actual
are in action. But large samples are expensive sample sizes could be
and time-consuming. What is the minimum smaller than with
sample to give useful item calibrations " dichotomies because there
calibrations that we can expect to be similar is more information in each
enough to maintain a useful level of polytomous observation.
measurement stability?
The first step is to clarify "similar enough." Just
Person Calibration Stability
as no person has a height stable to within .01
or even .1 inches, no item has a difficulty
stable to within .01 or even .1 logits. In fact,
stability to within .3 logits is the best that can The requirements are
be expected for most variables. Lee (RMT 6:2 symmetric for the Rasch
p.222-3) discovers that in many applications model so you need as
one logit change corresponds to one grade many items for a stable
level advance. So when an item calibration is person measure as you
stable within a logit, it will be targeted at a need persons for a stable
correct grade level. item measure.
For groups of items, Wright & Douglas (Best
Test Design and Self-Tailored Testing, MESA Memo. 19, 1975) report that, when
calibrations deviate in a random way from their optimal values, "as test length
increases above 30 items, virtually no reasonable testing situation risks a
measurement bias [for the examinees] large enough to notice." For even shorter
tests, measures based on item calibration with random deviations up to 0.5 logits
are "for all practical purposes free from bias."
Theoretically, the stability of an item calibration is its modelled standard error. For
a sample of N examinees, that is reasonably targeted at the items and that
responds to the test as intended, average item p-values are in the range 0.5 to
0.87, so that modelled item standard errors are in the range 2/sqrt(N) < SE <
3/sqrt(N) (Wright & Stone, Best Test Design, p.136), i.e, 4/SE 2 < N < 9/SE2. The
lower end of the range applies when the sample is targeted on items with 40%-
60% success rate, the higher end when the sample obtains success rates more
extreme than 15% or 85% success. As a rule of thumb, at least 8 correct
responses and 8 incorrect responses are needed for reasonable confidence that
an item calibration is within 1 logit of a stable value.
What, then, is the sample size needed to have 99% confidence that no item
calibration is more than 1 logit away from its stable value?
A two-tailed 99% confidence interval is 2.6 S.E. wide. For a 1 logit interval, this
S.E. is 1/2.6 logits. This gives a minimum sample in the range 4*(2.6) 2 < N <
9*(2.6)2, i.e, 27 < N < 61, depending on targeting. Thus, a sample of 50 well-
targeted examinees is conservative for obtaining useful, stable estimates. 30
examinees is enough for well- designed pilot studies. The Table suggests other
ranges. Inflate these sample sizes by 10%-40% if there are major sources of
unmodelled measurement disturbance, such as different testing conditions or
alternative curricula.
If much larger samples are conveniently available, divide them into smaller,
homogeneous samples of males, females, young, old, etc. in order to check the
stability of item calibrations in different measuring situations.

Item Calibrations Confidence Minimum sample size range Size for most
stable within (best to poor targeting) purposes

1 logit 95% 16 -- 36 30

1 logit 99% 27 -- 61 50

logit 95% 64 -- 144 100

logit 99% 108 -- 243 150

John Michael Linacre


Explanatory notes:
1. "For a 1 logit interval this S.E. is 1/2.6 logits."
An estimate's standard S.E. is the modelled standard deviation of the
normal distribution of the observed estimate around its "true" value.
Suppose we want to be 99% confident that the "true" item difficulty is
within 1 logit of its reported estimate. Then the estimate needs to have a
standard error of 1.0 logits divided by 2.6 or less = 1/2.6 = 0.385 logits.
2. "This gives a minimum sample in the range 4*(2.6) < N < 9*(2.6)"
With optimum targeting of a dichotomous test, the modeled probability of
each response is p=0.5. Then the modeled binomial variance = 0.5*0.5 =
the information in a response. Thus N perfectly targeted observations
have information N * 0.5 * 0.5 = N/4. This means that the S.E. of an
estimate produced by N perfectly targeted observations is S.E. = sqrt(4/N)
Similarly, for N extremely off-target observations (for a reasonable
dichotomous test), p=0.13 or p=0.87. For these, the modeled binomial
variance = 0.13*0.87 = the information in a response. N extremely off-
target observations have information N * 0.13 * 0.87 = N/9. This means
that the S.E. of an estimate produced by N perfectly targeted observations
is S.E. = sqrt(9/N)
So, for N observations, the minimum S.E. is sqrt (4/N) and a reasonable
maximum S.E. is sqrt(9/N). So, the minimum range of N to produce an
S.E. of 0.385 logits or better regardless of targeting is sqrt(4/N) = 1/2.6 for
the best case (lower limit)
and
sqrt(9/N) = 1/2.6 for the worst reasonable case (upper limit)
i.e., 4*(2.6) < N < 9*(2.6)*sup2; is the range of minimum values of N to
produce the desired S.E. (or better).

Sample Size and Item Calibration Stability. Linacre JM. Rasch Measurement
Transactions 1994 7:4 p.328

Sample Size and Item Calibration Stability. Linacre JM. Rasch Measurement
Transactions, 1994, 7:4 p.328

You might also like