You are on page 1of 14

coh30086_ch05_072-085.

qxd 12/17/08 07:22 PM Page 72 Confirmed Pages

CHAPTER 5

Reliability
Puzzle 5 Instructions Identify what is described, answer a question, or fill in the blank to complete this crossword puzzle based
on material presented in Chapter 5 of your textbook. Answers presented to clues in capital letters should be considered
as “free spaces” in the puzzle.

1 2 3

5 6

7 8 9

10

11 12

13

14

15 16

17 18 19

20 21

22 23

24

25 26

27

28

29

30 31

32

33

Across nominal scales of measurement is called the KAPPA


statistic.
1. In generalizability theory, an index of the influence that 7. A measure of variability equal to the arithmetic mean
particular facets have on a test score is called a coeffi- of the squares of the differences between the scores in a
cient of _______ . distribution and their mean.
5. A measure of inter-scorer reliability originally designed 10. In the true score model, the component of variance at-
for use in instances in which scorers make ratings using tributable to true differences in the ability or trait being
70
coh30086_ch05_072-085.qxd 12/17/08 07:22 PM Page 73 Confirmed Pages

RELIABILITY 71

measured inherent in an observed score or distribution 4. An estimate of test-retest reliability obtained during
of scores is referred to as _______ variance. time intervals of six months or longer is called a coeffi-
11. Another name for the standard error of measurement cient of _______ .
is standard error of a(n) _______ . 6. A(n) _______ test is usually one of achievement or
14. It’s the subject matter of the test items. ability with (1) either no time limit, or a time limit that
15. The extent to which individual test items of a test is so long that all testtakers will be able to attempt all
measure a single construct is referred to as test items, and (2) some items that are so difficult that no
_______ . testtaker will be able to obtain a perfect score.
18. _______ may be defined as the extent to which mea- 8. The now outdated RULON formula is an equation once
surements differ from occasion to occasion as a func- used to estimate internal consistency reliability.
tion of measurement error.
9. In the true score model, it’s the component of variance
20. Even-odd reliability or ODD-even reliability, it’s all
attributable to random sources irrelevant to the trait or
the same. Or is it?
ability the test purports to measure in an observed score
22. M. W. Richardson worked with G. Fredric _______ to
or distribution of scores. It’s _______ variance.
develop their own measures for estimating reliability.
In fact, M. W. is the R, and G. Fredric is the K in the 10. It’s a system of assumptions about measurement that
widely known KR-20 formula. includes the notion that a test score, and even a re-
26. A phenomenon associated with reliability estimates, sponse to an individual item, is composed of (1) a rela-
wherein the variance of either variable is a correla- tively stable component that actually is what the test or
tional analysis is inflated by the sampling procedure individual item is designed to measure, and (2) rela-
used and the resulting correlation coefficient tends to tively unstable components that collectively can be ac-
be higher as a consequence, is _______ of range. counted for as error. All of this is better known as
27. A statistic designed to aid in the determination of generalizability _______ .
how large a difference between two scores should be 12. The standard against which a test or a test score is eval-
before the difference should be considered statistically uated; it may take many different forms.
significant is the standard error of the _______ . 13. Also referred to as content sampling, we refer here to
28. An estimate of the internal consistency of a test ob- _______ sampling.
tained by correlating two pairs of scores obtained 16. An abbreviation for item response theory.
from equivalent halves of a single test administered 17. The range or band of test scores that is likely to contain
once is called _______-half reliability. the “true score” is called the _______ interval.
29. An estimate of reliability obtained by correlating pairs
19. An estimate of the extent to which item sampling and
of scores from the same people on two different ad-
other error have affected scores on two versions of the
ministrations of the same test is called test-_______
same test may be referred to as _______ forms reliability.
reliability.
21. This is a phenomenon associated with reliability esti-
30. A test with a time limit, usually of achievement or
mates wherein the variance of either variable in a corre-
ability and usually with items of a uniform level of
lational analysis is restricted by the sampling procedure
difficulty, is called a(n) _______ test.
used, and the resulting correlation coefficient tends to
31. He, Spearman, and their “prophecy” have been
be lower as a consequence. The phenomenon is called
immortalized in texts dealing with statistics and
_______ of range.
measurement.
32. Also known by names such as “raters” or “observers,” 23. Internal _______ is a reference to how consistently the
they typically enter data, not rulings. items of a test measure a single construct obtained from
33. The extent to which individual items of a test do not a single administration of a single form of the test and
measure a single construct but instead measure differ- the measurement of the degree of correlation among all
ent factors is referred to as test _______ . of the test items.
24. _______ forms reliability is an estimate of the extent to
which item sampling and other error have affected test
scores on two versions of the same test when, for each
Down form of the test, the means and variances of observed
2. An estimate of parallel-forms reliability or alternate- test scores are equal.
forms reliability is called a coefficient of _______ . 25. Also referred to as the standard error of a score, it is the
3. A statistic widely employed in test construction and standard error of _______ .
used to assist in deriving an estimate of reliability, it is 26. A general term to refer to an estimate of the stability
equal to the mean of all split-half reliabilities. It is co- of individual items in a test is _______ consistency
efficient _______ . reliability.
coh30086_ch05_072-085.qxd 12/17/08 07:22 PM Page 74 Confirmed Pages

72 THE SCIENCE OF PSYCHOLOGICAL MEASUREMENT

E XE R C I S E 5-1 BACKGROUND
MOVIES AND MEASUREMENT Broadly speaking, the concept of reliability as used in the
context of psychological testing refers to the attribute of con-
sistency in measurement. According to what is referred to as
the true score model or theory, a score on a test reflects not
only the “true” amount of whatever it is that is being mea-
sured (such as the true amount of an ability or the true
amount of a particular personality trait) but also other factors
including chance and other influences (such as noise, a trou-
blesome pen—virtually any random, irrelevant influence on
the testtaker’s performance). A reliability coefficient is an in-
dex of reliability—one that expresses the ratio between the
“true” score on a test and the total variance. We place the
word “true” in quotes because, as Stanley (1971, p. 361) so
aptly put it, a true score “is not the ultimate fact in the book
of the recording angel.” Rather, a true score on a test is
thought of as the (hypothetical) average of all the observed
test scores that would be obtained were an individual to take
the test over and over again an infinite number of times.
A perfect “10”? More technically, a true score is presumed to be the remain-
ing part of the observed score once the observed score is
stripped of the contribution of random error. Recall that

OBJECTIVE X=T+E
To think about the concept of the reliability of evaluations in where X represents an observed score, T represents a true
an everyday context score, and E represents an error score (a score due to ran-
dom, irrelevant influences on the test). Now let’s focus on
the squared standard deviations—or variances (symbolized
by lowercase sigmas)—of observed scores, true scores, and
BACKGROUND
error scores. The formula that follows,
Dudley Moore (above right) rates Bo Derek as a “perfect 10”
in the classic film, 10. This rating is presumably based on s 2 = s tr2 + s e2
subjective criteria related to beauty and related factors. Such
indicates that the total variance (σ2) in an observed score (or
ratings can provide a convenient point of departure for dis-
a distribution of observed scores) is equal to the sum of the
cussing psychometric issues such as reliability.
true variance (σtr2 ) and the error (σ e2 ) variance.
The reliability of a test—denoted below by the symbol rxx
to indicate that the same ability or trait (x) is being mea-
YOUR TASK sured twice—is an expression of the ratio of true to ob-
served variance:
Write a brief essay entitled “The Reliability of Interpersonal
Ratings” in which you make reference to Dudley Moore and s2
rxx = tr2
Bo Derek in the film 10. Discuss how the “test-retest relia- s
bility” of such ratings might change over time as a function If all of the observed scores in a distribution were entirely
of various events. free of error—and in essence equal to true scores—the calcu-
lated value of rxx would be 1. If all of the observed scores in a
distribution contained equal parts error and “true” ability (or
traits, or whatever), the calculated value of rxx would be .5.
E XE R C I S E 5-2 The lower range of a reliability coefficient is .00, and a coeffi-
cient of .00 would be indicative of a total lack of reliability;
THE CONCEPT OF RELIABILITY stated another way, such a quotient would be indicative of
total error variance (and a total absence of any variance due to
whatever it was that the test was supposed to have been
OBJECTIVE
measuring).
To enhance understanding of the concepts of reliability and How is a reliability coefficient calculated? While the ratio
error variance of true to observed variance serves us well in theory, it tends
coh30086_ch05_072-085.qxd 12/17/08 07:22 PM Page 75 Confirmed Pages

RELIABILITY 73

not to be very useful in everyday practice. For most data, we which you asked whether it would be possible to develop a
will never know what the “true” variance is, and so calculat- totally error-free test, here you are being asked if it is pos-
ing a reliability coefficient is more complicated than the sible to develop a test that would reflect nothing but error.
simple construction of the ratio. The reliability of a test is 3. Describe the role the concept of correlation plays in the
typically estimated using the appropriate method from any concept of reliability.
of a number of existing methods. Before getting to specifics,
however, let’s go back to the expression indicating that the
observed variance is equal to the true variance plus the error
variance,
E XE R C I S E 5-3
s 2
= s tr2 + s e2 TEST-RETEST AND
and rewrite that expression as follows, INTERSCORER RELIABILITY

s tr2 = s 2 − s e2
OBJECTIVE
and then substitute the resulting terms into the expression of To enhance understanding of and provide practical experi-
the ratio of true to observed variances: ence with the computation of test-retest reliability and inter-
s 2 − s e2 scorer reliability
rxx =
s2
Solving for rxx ,we derive the following expression of test BACKGROUND
reliability:
As part of Exercise 4-5, you were made privy to final exam-
s e2
rxx = 1 − ination score data for a class from a new home-study trade
s2 school of impersonation. Let’s now suppose that one morn-
In practice, an estimate of reliability as reflected in a reli- ing the chancellor of that school wakes up with a severe
ability coefficient is calculated by means of a coefficient of headache, terrible cramps, and a sudden interest in the area
correlation such as the Pearson r or Spearman’s rho— of psychometrics. Given this newfound interest, the chancel-
whichever is the appropriate statistic for the data. For exam- lor insists that all of the school’s ten students must re-take
ple, if the reliability coefficient to be calculated is of the test- the same (take-home) examination—this so that a coefficient
retest variety, you may wish to label scores from one admin- of test-retest reliability can be calculated. Let’s further sup-
istration of the test as the X variable and scores from the pose that only a week or so has elapsed since each of the stu-
second administration of the test as the Y variable; the Pear- dents first took the (not so) final examination. All of the
son r would then be used (provided all of the assumptions students comply, and the data for the first administration of
inherent in its use were met) to calculate the correlation the final examination as well as its re-administration are
coefficient—then more appropriately referred to as a “coeffi- presented below:
cient of reliability.” Similarly, if the reliability coefficient to
be calculated is a measure of interscorer reliability, you may Student Final Exam Score Retest Score
wish to label Judge 1’s scores as the X variable and Judge 2’s
Malcolm 98 84
scores as the Y variable and then employ either the formula
for the Pearson r or the Spearman rho (the latter being the Heywood 92 97
more appropriate statistic for ranked data). An exception to Mervin 45 63
this general rule is the case where a measure of internal con- Zeke 80 91
sistency is required; here, alternative statistics to r (such as Sam 76 87
coefficient alpha) may be more appropriate.
Macy 57 92
Elvis II 61 98
Jed 88 69
YOUR TASK Jeb 70 70
Answer these three questions in detail: Leroy 90 75
1. Is it possible to develop a test that will be totally free of
error variance? Explain why or why not.
YOUR TASK
2. As an academic exercise, what if you wished to develop an
ability-type test that in no way reflected the testtaker’s If you liked the exercise in Chapter 4 in which you cal-
ability? In other words, contrary to the question above, in culated what in essence was an alternate forms reliability
coh30086_ch05_072-085.qxd 12/17/08 07:22 PM Page 76 Confirmed Pages

74 THE SCIENCE OF PSYCHOLOGICAL MEASUREMENT

coefficient, you should also like your task here: calculating a BACKGROUND
test-retest coefficient of correlation.
“What is the nature of the correlation between one half of a
1. a. Create a scatterplot of these data. Simply by “eye- test and the other?” “What will be the estimated reliability of
balling” the obtained scatterplot, what would you say the test if I shorten the test by a given number of items?”
about the test-retest reliability of the final examination “What will be the estimated reliability of the test if I
the school is using? lengthen the test by a given number of items?” In answer to
b. For the purpose of this illustration, let’s assume that all these and related types of questions, the appropriate tool is
of the assumptions inherent in the use of a Pearson r the Spearman-Brown formula.
are applicable. Now use r to calculate a test-retest reli- Reduction in test size for the purpose of reducing test
ability coefficient. What percentage of the observed administration time is a common practice in situations where
variance is attributable to “true” differences in ability the test administrator may have only a limited amount of
on the part of the testtakers, and what percentage of time with the testtaker. In the version of the Spearman-
the observed variance is error variance? What are the Brown formula used to estimate the effect of reducing the
possible sources of error variance? length of a test, rsb is the Spearman-Brown formula, n repre-
2. Let’s say that instead of final examination score and re-test sents the fraction by which the test length is being reduced,
data, the scores listed represented the ratings of two former and rxy represents the reliability coefficient that exists prior
America’s Got Talent judges with respect to criteria like to the abbreviation of the test:
“general ability to impersonate Elvis Presley,” “accent,” nrxy
and “nonoriginality.” Relabeling the data for the final ex- rsb =
1 + (n − 1)rxy
amination as “Judge 1’s Ratings,” and relabeling the data
for the retest as “Judge 2’s Ratings,” rank-order the data
Let’s assume that a test user (or developer) wishes to
and calculate a coefficient of interscorer reliability using
reduce a test from 150 to 100 items; in this case, n would be
Spearman’s rho. To help get you started, a table you can
equal to the number of items in the revised version (100
use to convert the judge’s ratings to rankings follows. Af-
items) divided by the number of items in the original
ter you’ve computed the Spearman rho, answer these ques-
version (150):
tions: What is the calculated coefficient of interscorer
reliability coefficient, and what does it mean? 100
n= = .67
150
Judge 1 Judge 1 Judge 2 Judge 2
Student Rating Ranking Rating Ranking

Malcolm 98 ______ 84 ______ YOUR TASK


Heywood 92 ______ 97 ______
1. Assuming the original 150-item test had a measured reli-
Mervin 45 ______ 63 ______ ability (rxy) of .89, use the Spearman-Brown formula to
Zeke 80 ______ 91 ______ determine the reliability of the shortened test.
2. Now, how about some firsthand experience in using the
Sam 76 ______ 87 ______
Spearman-Brown formula to determine the number of
Macy 57 ______ 92 ______
items that would be needed in order to attain a desired
Elvis II 61 ______ 98 ______ level of reliability? Assume for the purpose of this exam-
Jed 88 ______ 69 ______ ple that the reliability coefficient (rxx) of an existing test
is .60 and that the desired reliability coefficient (rxx) is
Jeb 70 ______ 70 ______
.80. In the expression of the Spearman-Brown formula
Leroy 90 ______ 75 ______
below, n is equal to the factor that the number of items in
the test would have to be multiplied by in order to in-
crease the total number of items in the test to the total
number needed for a reliability coefficient at the desired
E XE R C I S E 5-4 level, r ′ is the desired reliability, and rxx is the reliability
of the existing test:
USING THE SPEARMAN-BROWN
r (1 − rxx )
FORMULA n=
rxx (1 − r )

Thus, for example, if n were calculated to be 3, a 50-item


OBJECTIVE
test would have to be increased by a factor of 3 (for a total of
To enhance understanding of and provide firsthand experi- 150 items) in order for the desired level of reliability to have
ence with the Spearman-Brown formula been reached. Try one example on your own. Assume now,
coh30086_ch05_072-085.qxd 12/17/08 07:22 PM Page 77 Confirmed Pages

RELIABILITY 75

A Scatterplot of Test and Retest Scores


coh30086_ch05_072-085.qxd 12/17/08 07:22 PM Page 78 Confirmed Pages

76 THE SCIENCE OF PSYCHOLOGICAL MEASUREMENT

A Scatterplot of the Ratings of Judge 1 and Judge 2


coh30086_ch05_072-085.qxd 12/17/08 07:22 PM Page 79 Confirmed Pages

RELIABILITY 77

for the purpose of example, that a 100-item test has an E XE R C I S E 5-6


rxx = .60. In order to increase the reliability of this test to .80,
how many items would be necessary?
FIGURE THIS

OBJECTIVE
Obtain firsthand computational experience figuring out prob-
E XE R C I S E 5-5
lems related to material presented in the chapter
UNDERSTANDING INTERNAL
CONSISTENCY RELIABILITY
BACKGROUND
Use your knowledge of material presented in Chapter 3 in
OBJECTIVE
your textbook to tackle Your Task in what follows.
To enhance understanding of the psychometric concept of
internal consistency reliability as well as methods used to
estimate it YOUR TASK
1. The school psychologist administered an IQ test with a
BACKGROUND mean of 100 and a standard deviation of 15 to six children.
Their scores were as follows: Sam 85, Jean 100, Byron
This exercise is designed to stimulate thought about the 126, LaKeisha 115, Hector 68, Hai 145. The reliability co-
meaning of an estimate of internal consistency reliability. efficient is .85 for this test. Calculate the following:
Your instructor may assign one, all, or only some of the parts a. Standard error of measurement for the test
of this exercise. b. 68% confidence interval for Sam and Jean
c. 95% confidence interval for Byron and LaKeisha
d. 99% confidence interval for Hector and Hai
YOUR TASK 2. Dexter took an IQ test and obtained a score of 105. He
1. In your own words, write a brief (about a paragraph or also took a math teacher achievement test and obtained a
two) essay entitled “The Psychometric Concept of Inter- score of 140. Both tests have a mean of 100 and standard
nal Consistency Reliability.” deviation of 15. The reliability coefficient for the IQ test
2. Using your school library, locate and read three pri- is .82 and for the math teacher achievement test is .91.
mary sources having to do with methods of obtaining Calculate the standard error of difference for Dexter’s
an estimate of internal consistency. On the basis of two test scores.
what you have learned from these articles, rewrite the 3. LaRonta also took the math teacher achievement test and
essay you wrote in Part 1, incorporating the new infor- obtained a score of 145. Calculate the standard error of dif-
mation. Your new essay should be no more than two ference and compare LaRonta and Dexter’s performance.
pages. Who would you want to teach you statistics and why?
3. A number of different methods may be used to obtain an
estimate of internal consistency reliability. In a sentence
or two, describe when each of the following would be
appropriate: E XE R C I S E 5-7
a. the Spearman-Brown formula STANDARDS FOR TESTS:
b. coefficient alpha
c. KR-20
THEN AND NOW
4. Each of the following statements is true. In one or two
sentences, explain why this is so. OBJECTIVE
a. An internal consistency reliability estimate is typically
achieved through only one test session. To obtain a historical perspective on desirable criteria for
b. An estimate of internal consistency reliability is in- standardized tests by comparing a 1920s-era call for such
appropriate for heterogeneous tests. criteria with the current edition of Standards for Educational
c. An estimate of internal consistency reliability is inap- and Psychological Testing
propriate for speeded tests.
d. When estimating internal consistency reliability, the
BACKGROUND
size of the obtained reliability coefficient depends not
only on the internal consistency of the test but also on Over a half-century ago, measurement expert Giles M. Ruch
the number of test items. proposed standards for tests that in many ways anticipated
coh30086_ch05_072-085.qxd 12/17/08 07:22 PM Page 80 Confirmed Pages

78 THE SCIENCE OF PSYCHOLOGICAL MEASUREMENT

the current version of Standards for Educational and Psy- experimental studies (for example, word counts,
chological Testing. The original text of Ruch’s (1925) article analysis of social needs), age or grade rise in percent
is reprinted below. of successes, judgments of “competent” persons, cor-
relation with an outside or independent criterion, or
merely “armchair analysis” and “insight.”
2. Statement of the exact details of all experimental
MINIMUM ESSENTIALS IN REPORTING work leading to the final forms of the test.
DATA ON STANDARD TESTS 3. Statement of the diagnostic powers of the test, if any.
4. Statement of the exact field or range of educational or
G. M. Ruch, State University of Iowa
mental functions for which the test is claimed to be
With the increasing number of educational and mental valid.
tests, an already bewildering situation is daily becoming 5. Statement whether the test is adequate for class mea-
more aggravated. The writer refers particularly to the task surement or pupil measurement. (This is largely a
of the superintendent, director of research, and the college matter of reliability.)
teacher of tests and measurements, who is confronted with 6. Statement relative to equality of forms and how
the problem of selecting and recommending the “best” test guaranteed. The same applies to equality of variabil-
to use. Even if the cards were all on the table, the selection ities, namely, equal standard deviations for same
of the “best” test would present grave difficulties. These groups.
decisions must be made, and are being made daily, but it is 7. Statement of the degree of correlation of the test with
an open question whether any living being possesses the the criterion, or with school success, age, etc.; or
exact knowledge required to make such decisions in any- statement of the agreement of test scores with the at-
thing approaching a scientific matter. Where does the diffi- tainments of groups of individuals known to be
culty lie? widely spaced upon a scale of abilities.
The blame, to use a harsh term, lies primarily with the 8. Statement of the functions that the test can legiti-
test authors, secondarily with the publishers of tests, and mately claim to serve; for example, prognosis, sec-
finally, to some extent, with the users of tests—the relative tioning of classes, assignment of official grades,
culpabilities showing wide individual differences within determination of passing or failing, efficiency of
these groups. instruction.
The logic of the situation, viewed broadly, would seem
These should be accompanied by summaries of the ex-
to demand that two things be done at once: first, that a set
perimental evidence.
of working criteria for test construction be established—
and much progress has been made on this point; and sec- Reliability—the accuracy with which the test measures.
ond, that test authors, test publishers, test users, and test This is independent of its validity, although high validity
investigators adopt some fairly uniform practices in report- demands high reliability. Such facts as the following are
ing on tests, at least to the extent of a few minimum essen- absolutely essential:
tials. Except in isolated cases, this has not been the regular
1. Reliability coefficients, which are usually to be deter-
practice. Although the second of these points is the pri-
mined by correlation of similar forms. These are,
mary consideration of this paper, a few statements about
however, practically meaningless unless accompa-
the first1 will help to clarify matters.
nied by statements of (1) the range of talent involved
in the determination—best stated in terms of the stan-
Criteria for Evaluating Tests and Measurements
dard deviation of the test scores; (2) the population
Validity—the general worth or merit of the test. A de- involved in the determination of the r’s; (3) the age
scription of the validation of a test may well include such or grade groups involved and any evidence leading to
facts as the following: judgments of the amounts of selection or systematic
1. The criterion against which the test was validated: tendencies, or other factors militating against the rep-
analysis of courses of study, analysis of textbooks, resentativeness of the sampling; (4) the order of giv-
recommendations of national educational committees, ing the forms of the test; and (5) the mean scores on
each form. These facts should be given in sufficient
detail to permit a second investigator to reproduce the
essential conditions of the experimentation at will.
1 Kelley,T. L. “The Reliability of Test Scores,” Journal of Educational Re- 2. Certain derived measures, which are relatively inde-
search, 3:370–79, May, 1921. pendent of (1) the range of talent employed and
Monroe, W. S. The Theory of Educational Measurements. Boston, (2) the arbitrariness of the test units.
Houghton Mifflin Company, 1923, pp. 182–231.
McCall, W. A. How to Measure in Education. New York, Macmillan Com- Ease of administration and scoring. The following facts
pany, 1922, pp. 195–410. might be given:
coh30086_ch05_072-085.qxd 12/17/08 07:22 PM Page 81 Confirmed Pages

RELIABILITY 79

1. Degree of objectivity of scoring. This influences very important factor in validity and reliability, and
markedly the reliability and hence the validity of the (4) profits to publisher and author.
test. With respect to all of the before-mentioned desiderata,
2. Time for giving. This is of minor importance without no exact procedure can or need be recommended. It is
supplementation by other facts, the popular opinion merely the “spirit of the game” that is fundamental. There
to the contrary, notwithstanding. The proper criteria is, however, one very important question to be asked,
should be: Validity per unit testing time, reliability Where should the above data be published? The best
per unit testing time, or some similar point of place, theoretically and practically, is in the manual of di-
view. rections accompanying the test. This can well be ex-
3. Time needed for scoring. This is best stated in terms panded to four, ten, or even one hundred pages. The next
of the average number of papers scored per unit of best procedure is probably that of publishing abstracts of
time. That this is of secondary importance is shown the complete description of the test in the manual of direc-
by the fact that comparatively few standard tests ex- tions, reserving the details for articles in the standard jour-
ceed the ordinary written examination in the time or nals or for publication in a separate monograph. The
labor of scoring. important thing is that full accounts be made accessible to
4. Simplicity of directions for pupil and examiner. the critical user or student of tests. To the user of tests
should be extended the privilege of choice with open eyes,
Norms. This should include: namely, with the “cards all on the table,” to repeat an
earlier statement.
1. Kinds of norms available (age, grade, percentile,
If space permitted, the writer could at least entertain the
mental age, T scores, etc.)
reader by quoting numerous replies received from authors
2. Statement of derivation of norms. This should cover,
of secondary-school tests in response to an appeal for such
specifically, facts like those listed under “Reliability”
data as have been outlined. Parenthetically it might be
above. The important thing is the representativeness
stated that fully 75 percent of test authors had made no
of the sampling, not the size. Norms on one hundred
systematic or critical study of their tests, and not a few did
thousand cases are not necessarily as accurate as
not comprehend the conventional test terminology. One re-
those based on ten thousand, or even one thousand
sponded to a question about the reliability of his test: “This
cases. The validity of a norm is not determined by its
is not an intelligence test.” The correspondent must have
probable error but by the principles and laws of sam-
been amazed at the writer’s naïveté in expecting an educa-
pling observed or violated. Norms, at best, are doubt-
tional test to possess reliability at all!
ful devices;2 and blind faith in numbers approaches
“hocus pocus” at times. The most important thing
that a test can do is to place the pupil in accurate The Reporting of Reliabilities of Test Scores
rank positions along a scale of true ability. The com-
The remainder of this discussion will be devoted to a sin-
mon practice of pooling test-score tabulations volun-
gle one of the criteria listed for the evaluation of tests,
tarily returned to the author, without a scrupulous
namely, reliability. Attention is directed to this topic partly
program for the elimination of the almost inevitable
because of its importance and partly because the data
selection effects incident to this procedure, is particu-
available on it are exceedingly meager. There has been a
larly to be regretted.
wide variety of practices in reporting test reliabilities—
when, indeed, such have been reported at all—as follows:
Cost. This has little or no theoretical interest but is a very
practical consideration. The cost per pupil is practically
he reliability coefficient 3 r12
valueless unless other facts are weighed. Validity per unit
cost is the criterion to apply. A test may be a poor invest- he probable error of estimate,3
ment at one cent per pupil, while a second test would be P.E.1 • 2 = 0.6745s 1 I − r12
2

cheap at ten cents per pupil. Costs of test vary with (1)
the cost of experimental work (to the writer’s knowledge, he probable error of measurement, 4
s1 + s 2
the variation on this point ranges from less than $100 to at P.E. M = 0.6745s I − r12 , where s =
least $10,000), (2) quality of printing, (3) length of test—a 2
he index of reliability,4 r1t = r12

2 SeeChapman, J. C. “Some Elementary Statistical Considerations in Edu-


cational Measurements,” Journal of Educational Research, 4:212–20, 3See
October, 1921 for an excellent treatment of this question, and Manual of any standard textbook on statistics.
4Monroe, W. S. The Theory of Educational Measurements. Boston,
Directions, Stanford Achievement Test, Revised Edition; particularly the
Appendix. Houghton Mifflin Company, 1923, pp. 206 ff.
coh30086_ch05_072-085.qxd 12/17/08 07:22 PM Page 82 Confirmed Pages

80 THE SCIENCE OF PSYCHOLOGICAL MEASUREMENT

P.E. M It is rather beyond the scope of this paper, and beyond


, where the P.E. M is given by formula 3 above,5 the ability of the writer as well, to demonstrate the ab-
M
solute superiority of any one of the six proposals. How-
and M is the mean of the distribution of scores,
ever, certain of these have grave defects that must be
5 M + M2 made apparent in the interests of their abandonment for
and, presumably, equals, 1
2 the purpose. The six methods will be commented on
The probable error of estimate of a true score by briefly in turn.
means of a single score of the same function,6
What Is the Best Method
P.E.∞ • 1 = 0.6745s 1 r12 − r12
2
of Stating Test Reliabilities?
The first method, the reliability coefficient, can be held to
Examination of the six procedures just listed will show be almost valueless, per se, for two reasons: (1) it is a func-
at once that every one, except the first, in part, calls for two tion of the range of talent (σ) and hence has no general sta-
fundamental facts, namely, (1) the reliability coefficient; bility, and (2) alone, it tells nothing about the behavior of
and (2) the standard deviations of the two distributions. the individual score. For Examiner A to report r12 as 0.85
To these must be added, at least (3) the population on for Test X and Examiner B to report 0.64 for Test Y does
which r12 is based in order to calculate the probable error not at all imply that Test X is more reliable than Test Y. It
of r. would depend upon the range of talent employed. Assume
The following are most desirable and probably should the following facts for the case:
always be reported: (4) a verbal description of the kind of
talent involved in the calculation of r, for example, age or Test X Test Y
grade group, kind of school, and possible selective factors r12 0.85 0.64
militating against the representativeness of the sampling; Group tested 500 pupils 500 pupils
(5) the mean scores of the two distributions leading to r; (Grades iv–xii) (Grade vi only)
(6) the order of giving the tests7 and the conditions under Standard deviation 40.4 10.1
which the testing was carried out. It may readily be shown that Test Y, if applied to the group
It is greatly to be regretted that one further recommenda- to which Test X was given, would show a reliability co-
tion is not always practicable, namely, (7) the publication efficient in the neighborhood of 0.98.8
of the scatter diagrams for all r’s reported—this for its The needed measure of reliability must be independent,
bearing on the lack of rectilinearity of the regressions and in large part, of the influence of range of talent. The re-
the possibility of faulty grouping. Both of these factors liability coefficient, as we have seen, is not. The practice
lower greatly the obtained r in comparison with the truth of reporting r’s unsupported by other facts should be
of the relationship. discontinued.
Returning to our statements of the various methods of The second proposed measure is the probable error of
treating the reliability of test scores, it will be seen at once estimate,
that our list of needed facts more than covers the needs of
any or all of the six procedures, in fact, points (1) and (2) P.E.1 • 2 = 0.6745 s 1 1 − r12
2
.
alone will suffice for the purely computational procedures.
Granting this, the task of estimating the reliability of a test This, however, does not serve the purpose. It helps little, or
still would be a bit of a “leap in the dark” without the sub- not at all, to obtain an estimate of the score in a second
sidiary data outlined in points (2) to (7), inclusive. form of a test, when it is a simple matter to obtain a much
better second score by actually giving the second test. An
estimate of a true score is what is needed. The probable er-
ror of estimate does have a real value9 as a measure of
5Monroe, W. S. A Critical Study of Certain Silent Reading Tests. Urbana, alienation from perfect prediction.
University of Illinois, 1922, pp. 32 ff. (Bureau of Educational Research,
College of Education, University of Illinois Bulletin Vol. XIX, Series,
No. 8)
6Kelley, T. L. Statistical Method. New York, Macmillan Company, 1923,
212 ff.
8See Kelley, T. L. Statistical Method. New York, Macmillan Company,
7Because the order of giving the tests must necessarily be different for the
1923, p. 222, formula 178;
determination of reliability coefficients than for the investigation of equiva-
lence of difficulties from form to form. For the first purpose, all pupils s1 1− R 10.1 1 1− R
= , or, = = . R = 0.98 − .
should take the tests in the same order, e.g., Form A followed by Form B; Σ1 1− r 40.4 4 1 − .64
for the second purpose, one half of the group should take Form A first and
one half Form B first. In the first case we want the practice effects to be sys- It is assumed that both tests are scaled to the same units.
2
tematic; in the second, they should tend to be neutralized. 9 Especially in the form recommended by Kelley, i.e., k = 1 − r12 .
coh30086_ch05_072-085.qxd 12/17/08 07:22 PM Page 83 Confirmed Pages

RELIABILITY 81

The third proposal (by Monroe, apparently) of a proba- ratio12 as a measure of the improvement due to the use of
ble error of a test score is a development of the same for- the test.
mula. It has been shown that the correlation of obtained In conclusion, it will readily be seen that the real need in
scores with true scores of the same function is reporting reliability data on test scores is the publication of
a minimum of four things:
r1 • ∞ = r12 ,
1. The reliability coefficient
hence the probable error of estimate of true scores from 2. The standard deviations of the two distributions
obtained scores is
3. The population involved in the calculation of the r’s
P.E.1• ∞ = 0.6745 s 1 I − r12 4. The means of the two distributions
These four data will permit the treatment of reliability by
by substitution in the formula for the probable error of es-
any of the methods proposed to date, and in addition, esti-
timate. This, however, is not the needed formula but the re-
mates of true scores, and prophecies of changes in reliabil-
verse, for example, the probable error of estimate of a true
ity with changes in the range of talent. The estimate of
score from an obtained score. Such a formula is number 6
regression effects is implied as another possible procedure;
in the list, namely,
and, in cases of intercorrelations of test scores, the applica-
2
P.E.∞•1 = 0.6745 s 1 r12 − r12 . tion of correction formulas for attenuation is made possible
in evaluating true relationships of the variables within the
The fourth proposal, namely, the ratio of P.E.M to M, in validity of the assumptions of such correction formulas.
the opinion of the writer, has no utility. Further comment is Non-publication of such data as we have recommended
omitted here, because it has received an able criticism10 is really a violation of the ethical codes of scientific proce-
since the first draft of this paper was written. dures and not to be condoned by virtue of the fact that
The fifth formula users of tests generally will not understand the technicali-
ties. Rather, the teacher of tests and measurements should
r1t = r12 attempt to educate outgoing students to demand such con-
fidences on the part of test authors. The alternative will
is useful in evaluating certain correlation situations but
certainly often be the refusal to recommend to school offi-
would seem to have no particular reference to the problem
cials tests and scales upon which no critical facts are at
of the reliability of test scores.
hand. Such tests are not necessarily undependable, but the
The last formula is probably the only one of the list that
careful worker will not wish to assume responsibilities of
is entirely adequate to the problem at hand. It possesses
proof, which in all fairness rest upon the author of the test.
the merits of being independent of the range of talent
The test buyer is surely entitled to the same protection as
and of allowing for regression effects in test scores. The
the buyer of food products, namely, the true ingredients
formula,
printed on the outside of each package.
2
P.E.∞•1 = 0.6745 s 1 r12 − r12 , This statement alone is offered as a sufficient justifica-
tion for presenting facts that are in no sense original with
gives us the probable error of estimating true scores from the writer.
obtained or fallible scores, when the true scores are esti-
mated by the formula,11

∞ •1X 12 1 = r X + (I − r ) M
12 1
_ YOUR TASK
X∞ • 1 may be regarded as the best estimate possible of a
true score, such as would be obtained by the average of an Compare the views of Ruch as expressed in the previous
infinite number of obtained (X1) scores. It is the estimated article with the contents of the current Standards for Educa-
true score and its probable error that are needed and not the tional and Psychological Testing (probably available in the
reverse as in the case of the third proposal above. Kelley reference section of your university library). In what ways
suggests further the use of the did Ruch (1925) anticipate the Standards? In what ways
could Ruch’s views be informed by the Standards?
P.E.∞ •1
s1

12Actually, s ∞ •1 appears in Kelley’s recommendation, but either ratio


s1
10 Franzen, F. R. “Statistical Issues,” Journal of Educational Psychology, leads to similar interpretations (Loc. cit., p. 215). The σ1 cancels in numer-
15:367–82, September, 1924. ator and denominator, leaving the expression under the radical; namely,
2 as the important measure.
11Kelley, T. L. Op. cit., p. 214, formula 168. r12 − r12
coh30086_ch05_072-085.qxd 1/5/09 03:29 PM Page 84
Rev. Confirming Pages

82 THE SCIENCE OF PSYCHOLOGICAL MEASUREMENT

E XE R C I S E 5-8 may perform very differently on the BSID-II at the two


testings. In such cases, a change in test score would not be
THE RELIABILITY the result of error in the test itself or in test administration.
OF THE BAYLEY-III Instead, such changes in the test score could reflect an
actual change in the child’s skills. Of course, not all differ-
ences between the child’s test performance at two test
OBJECTIVE administrations need to result from changes in skills.
To consult a primary source—in this case a specific test The challenge in gauging the test-retest reliability of the
manual—as well as other sources, and update an essay based BSID-II is to do it in such a way that it is not spuriously
on updated information regarding a specific test lowered by the testtaker’s actual developmental changes
between testings.
Bayley’s solution to this dilemma entailed examining
BACKGROUND test-retest reliability over short periods of time. The
median interval between testings was just four days. Cor-
The Bayley Scales of Infant Development (BSID; Bayley, relations between the results of the two testing sessions
1969) were revised in 1993 and this revision of the test was were strong for both the Mental (.83 to .91) and the Motor
referred to as the Bayley Scales of Infant Development, Sec- (.77 to .79) Scales. The Behavior Rating Scale demon-
ond Edition (or BSID-II). What follows below is a brief es- strated lower test-retest reliability: .48 to .70 at 1 month of
say on the reliability of the BSID-II. age, .57 to .90 at 12 months of age, and .60 to .71 at 24 to
36 months of age (Bayley, 1993).
Inter-scorer reliability is an important concern for the
THE RELIABILITY OF THE BSID-II BSID-II because many items require judgment on the part
of the examiner. The test manual provides clear criteria for
The Bayley Scales of Infant Development (BSID; Bayley, scoring the infant’s performance. However, by their nature,
1969) were designed to sample for measurement aspects of many of the tasks involve some subjectivity in scoring. For
the mental, motor, and behavioral development of infants. example, one of the Motor Scale items is “Keeps hands
Bayley scores tended to drift upward over the course of open . . . Scoring: Give credit if the child holds his hands
some two decades of use (Schuler et al., 2003), and the test open most of the time when he is free to follow his own
was revised in 1993. interests” (Bayley, 1993, p. 147). Sources of examiner
Much like the original test, the Bayley Scales of Infant error on this item can arise from a variety of sources. Dif-
Development, second edition (BSID-II; Bayley, 1993), ferent examiners may note the position of the child’s
was designed to assess the developmental level of children hands at different times. Examiners may define differently
between 1 month and 31⁄2 years old. It is used primarily to when the child is “free to follow his own interests.” And
help identify children who are developing slowly and examiners may disagree about what constitutes “most of
might benefit from cognitive intervention. The BSID-II the time.”
includes three scales. Items on the Motor Scale focus on An alternate or parallel form of the BSID-II does not
the control and skill employed in bodily movements. Items exist, so alternate-forms reliability cannot be assessed. An
on the Mental Scale focus on cognitive abilities. The alternate form of the test would be useful, especially in
Behavior Rating Scale assesses behavior problems, such as cases in which the examiner makes a mistake in adminis-
lack of attention. tering the first version of it. Still, the creation of an alter-
Is the BSID-II a reliable measure? Because the Mental, nate form of this test would almost surely entail a great
Motor, and Behavior Rating Scales are each expected to investment of time, money, and effort. If you were the
measure a homogeneous set of abilities, internal consis- test’s publisher, would you make that investment? In con-
tency reliability for each of these scales is an appropriate sidering the answer to that question, don’t forget that the
measure of reliability. Bayley (1993) reported coefficient ability level of the testtaker is changing rapidly.
alphas ranging from .78 to .93 for the Mental Scale (varia- Nellis and Gridley (1994) noted that a primary goal of
tions exist across the age groups), .75 to .91 for the Motor revision was to strengthen the test psychometrically. Based
Scale, and .64 to .92 for the Behavior Rating Scale. From on the data provided in the test manual, Nellis and Gridley
these reliability studies, Bayley (1993) concluded that the concluded that this goal was accomplished: The BSID-II
BSID-II is internally consistent. does seem to be more reliable than the original Bayley
Consider, however, an issue unique to instruments used Scales. However, there are still some important weak-
in assessing infants. We know that cognitive development nesses. For example, the manual focuses on the psychome-
during the first months and years of life is uneven and fast. tric quality of the BSID-II as administered to children
Children often grow in spurts, changing dramatically over without significant developmental problems. Whether the
a few days (Hetherington & Parke, 1993). The child tested same levels of reliability would be obtained with children
just before and again just after a developmental advance who are developmentally delayed is unknown. Perhaps a
coh30086_ch05_072-085.qxd 1/5/09 03:29 PM Page 85
Rev. Confirming Pages

RELIABILITY 83

more intriguing unknown is the question of why there was THE 4-QUESTION CHALLENGE
drift in the scores upward over the course of about two
decades in which the first edition was in use. Will this phe- 1. A coefficient of reliability is
nomenon of upward score-drift repeat itself in two decades a. a proportion that indicates the ratio between true score
or so of use of the second edition? Time will tell. variance on a test and the total variance.
b. a proportion that indicates the ratio between a partial
universe score and the total universe.
c. equal to the ratio between the variance and the stan-
YOUR TASK dard deviation in a normal distribution.
In 2005, the third edition of the Bayley Scales (otherwise d. equal to the standard error of the difference between
known as the Bayley-III) was published. Using the test man- parallel forms of two criterion-referenced tests.
ual for the Bayley-III as well as other published sources, 2. Test construction, test administration, and test scoring and
update the discussion of the second edition test with your interpretation are
findings regarding the third edition. Title your essay “The a. sources of error variance.
Reliability of the Bayley-III” and feel free to incorporate in it b. the sole responsibility of a test publisher.
any of the material in the essay above. Make sure to express c. “facets” according to true score theory.
your opinion regarding how the third edition of the test is or d. variables affected by inflation of range.
is not an improvement over the second edition. Also, update
the discussion with regard to “upward score-drift” and voice 3. A measure of a test’s internal consistency reliability could
your own opinion about what seems to be happening. be obtained through the use of
a. Kuder-Richardson formula 20.
b. Cronbach’s coefficient alpha.
REFERENCES c. the Spearman-Brown formula.
d. all of the above
Bayley, N. (1969). Bayley Scales of Infant Development: Birth to Two
Years. New York: Psychological Corporation. 4. In contrast to a power test, a speed test
Bayley, N. (1993). Bayley Scales of Infant Development (2nd Edition) a. has a time limit designed to be long enough to allow
Manual. San Antonio, TX: Psychological Corporation.
all testtakers to attempt all items.
Nellis, L., & Gridley, B. E. (1994). Review of the Bayley Scales of In-
fant Development—Second Edition. Journal of School Psychology, b. can yield a split-half reliability estimate based on only
32, 201–209. one administration of the test.
Ruch, G. M. (1925). Minimum essentials in reporting data on standard c. tends to yield score differences among testtakers that
tests. Journal of Educational Research, 12, 349–358.
are based on performance speed.
Stanley, J. C. (1971). Reliability. In R. L. Thorndike (Ed.), Educational
measurement (2nd ed.). Washington, D.C.: American Council on d. tends to yield spuriously inflated estimates of alternate
Education. forms reliability.

You might also like