Anne Anastasi - Psychological Testing I

ANNE~NASTASI
Professor of Psychology, Fordham Universiry
Psyclwlvgical Testing
MACMILLAN PUBLISHING CO., INC.

New York
Collier Maonillan Publishers

London
I N A revised edition, one expects both similarities and differences. This
edition shares with the earlier versions the objectives and basic approach
of the book. The primary goal of this text is still to contribute toward the
proper evaluation of psychological tests and the correct interpretation
and use of test results. This goal calls for several kinds of information:
( 1) an understanding of the major principles of test construction, (2)
psychological knowledge about the behavior being assessed, (3) sensi-
tivity to the social and ethical implications of test use, and (4) broad
familiarity with the types of available instruments and the sources of
information about tests. A minor innovation in the fourth edition is the
addition of a suggested outline for test evaluation (Appendix C).
In successive editions, it has been necessary to exercise more and more
restraint to keep the number of specific tests discussed in the book from
growing with the field-it has never been my intention to provide a
miniature Mental Measurements Yearbook! l:\evertheless, I am aware
that principles of test co~struction and interpretation can be better un-
derstood when applied to~particular tests. Moreover, acquaintance with
the major types of available tests, together with an understanding of
All rights reserved. No part of this book may be reproduced or
transmitted in any form or by any means, electronic or me- their special contributions and limitations, is an es!>entialcomponent of
chanical, including photocopying, recording, or any informa- knowledge about contemporary testing. For these reasons, specific tests
tion storage and retrieval system, without permission in writing are again examined and evaluated in Parts 3, 4, and 5. These tests have
from the Publisher. been chosen either because they are outstanding examples with which
Earlier editions copyright 1954 and © 1961 by Macmillan the student of testing should be familiar or because they illustrate some
Publishing Co., Inc., and copyright © 1968 by Anne Anastasi. special point of test construction or interpretation. In the text itself, the
MACMILLAN PUBLISHING Co., INC.
principal focus is on types of tests rather than on specific instruments. At
866 Third Avenue, New York, New York 10022 the same time, Appendix E contains a classified list of over 250 tests,
COLLIER MACMILLAN CANADA, LTD.
including not only those cited in the text but also others added to provide
a more representative sample.
Librarlj of Congress Cataloging in Publication Data As for the differences-they loomed especially large during the prepa-
Anastasi, Anne, (date)
ration of this edition. Much that has happened in human society since
Psychological testing. the mid-1960's has had an impact on psychological testing. Some of these
developments were briefly described in the last two chapters of the third
Bibliography: p.
Includes indexes. edition. Today they have become part of the mairn;tream.;()fpsychological'
1. Mental tests. 2. Personality tests. I. Title. testing and have been accordingly incorpo~i-ted in the apprqpqate sec-
[DNLM: 1. Psychological tests. WM145 A534P] tions throughout the book. Recent changes in psychological Jesting that
BF431.A573 1976 153·9 75-2206 are reflected in the present edition can be delpribed on three levels:
ISBN O-<>2-30298<r3 (1) general orientation toward testing, (2) Stlbm,IJ,tiveand inethod()l~i-
cal developments, and (3) "ordinary progress" w1)Q as the publiciitibn
of new tests and revision of earlier tests.
Preface Preface vii
; An example of changes on the first level is the increasing awareness of Forms Sand T of the DAT (including a computerized Career Planning
~e ethical, social, and legal implications of t~sting. In the present edi- Program), the Strong-Campbell Interest Inventory (merged form of the
lon, this topic has been expanded and treated 111a separate chapter early SVIB), and the latest revisions of the Stanford Achievement Test and the
b the book (Ch. 3) and in Appendixes A and B. A cluster of related Metropolitan Readiness Tests.
l..
evelopments represe~ts a bro~dening of.test u~es..Beside~ the tradi~ion~l
'pplications of tests 111 selectwn and diagnosIs, 111creasmg attention IS
eing given to administering tests for self-kuowledge and self-develop-
~entl and to training individuals in the use of their own test res?lts. in
It is a pleasure to acknowledge the assis~nce received from many
sources in the preparation of this edition. The completion of the project
was facilitated by a one-semester Faculty Fellowship awarded by Ford-
ham University and by a grant from the Fordham University Research
,lJecisionmaking (Chs. 3 and 4). In the same category are the contmumg Council covering principally the services of a research assistant. These
~eplacementof global scores with multitrait profiles and the application services were performed by Stanley Friedland with an unusual combina-
bf classificationstrategies, whereby "everyone can be above average" in tion of expertise, responSibility, and graciousness. I am indebted to the
bne or more socially valued "ariables (Ch. 7). From another angle, many authors and test publishers who provided reprints, unpublished
rffortsare being made to modify traditional interpretations of test scores, manuscripts, specimen sets of tests, and answers to my innumerable in-
~n bothcognitive and noncognitive areas, in the light of accumulating quiries by mail and telephone. For assistance extending far beyond the
psychological knowledge. In this edition, Chapter 12 brings together interests and responsibilities of any single publisher, I am especially
'psychologicalissues in the interpretation of intelligence test scores, grateful to Anna Dragositz of Educational Testing Service and Blythe
:touchingon such problems as stability and change in intellectual level Mitchell of Harcourt Brace Jovanovich, Ioc. I want to acknowledge the
.overtime; the nature of intelligence; and the testing of intelligence in Significant contribution of John T. Cowles of the University of Pittsburgh,
:earlychildhood, in old age, and in different cultures. Another example who assumed complete responSibility for the preparation of the Instruc-
is provided by the increasing emphasis on situational specificity and tor's Manual to accompany this text.
I person-by-situationinteractions in personality testing, stimulated in large For informative discussions and critical comments on particular topics,
partbythe social-learning theorists (Ch. 17). I want to convey my sincere thanks to Willianl H. Angoff of Educational
T~e second level, -covering substantive and methodological changes, Testing Service and to several members of the Fordham University Psy-
is illustratedby the impact of computers on the development, administra- chology Department, including David R. Chabot, Marvin Reznikoff,
"tioll,scoring, and interpretation of tests (see especially Chs. 4, 11, 13, 17, Reube~ M. Schonebaum, and 'Warren, W. Tryon. Grateful acknowledg-
18, W). The use of computers in administering or managing instructional ment IS also made of the thoughtful recommendations submitted by
pro/ramshas also stimulated the development of criterion-referenced course instructors in response to the questionnaire distributed to current
t~~~ although other conditions have contributed to the upsurge of users of the third edition. Special thanks in this connection am due to
c
'i!restin such tests in education. Criterion-referenced tests are discussed Mary Carol Cahill for her extensive, constructive, and Wide-ranging
'1 •
,. 'pally in Chapters 4,5, and 14. Other types of lllstruments that have suggestions. I wish to express my appreciation to Victoria Overton of
to prominence and have received fuller treatment in the present the Fordham University library staff for her efficient and courteous as-
n include: tests for identifying specific learning disabilities (Ch. sistance in bibliographic matters. Finany, I am happy to record the
inventories and other devices for use in behavior modification pro-' contributions of my husband, John Porter Foley, Jr., who again partici-
(Ch. 20), instruments for assessing early ch~ldhOod education pated in the solution of countless problems at all stages in the prepara-
14), Piagetian "ordinal" scales (Chs. 10 and 14), basic education tion of the book.
literacy tests for adults (Cbs. 13 and 14), and techniques for the
A.A.
ment of environments (Ch. 20). Problems to be considered in the
, ment of minority groups, including the question of test bias, are
ined from different angles in Chapters 3, 7, 8, and 12.
the third level, it may be noted that over 100 of the tests listed in
edition have been either initially pUblished or revised since the
ication of the preceding edition (1968). Major examples include the
arthy Scales of Children's Abilities, the WISC-R, the 1972 Stanford-
norms (with all the resulting readjustments in interpretations),
CONTENTS
PART 1
CONTEXT OF PSYCHOLOGICAL TESTING
1. FUNCTIONS AND ORIGINS OF

PSYCHOLOGICAL TESTING 3
Current uses of psychological tests Q
Early interest in classification and training of the mentally
retarded 5
The first experimental psychologists 7
Contributions of Francis Galton 8
Cattell and the early "mental tests" 9
Binet and the nse of intelligence tests 10
Group testing 12
Aptitude testing 13 ~
Standardized achievement tests 16
Measurement of personality 18
Sources of information about tests 20
2. NATURE AND USE OF

PSYCHOLOGICAL TESTS
What is a psychological test? 23
Reasons for controlling the use of psychological tests
Test administration 32
Rapport 34
Test anxiet\' 37
Examiner ~nd situational variables 39
Coaching, practice, and test sophistication 41
3. SOCIAL AND ETHICAL IMPLICATIONS

OF TESTING "
User qualifications 45
Testing instruments and procedures 47
Protection of privacy . 49
Confidentiality 52
Communicating test results 56
Testing and the civil rights of minorities 57
ix
PART 3
TESTS OF GENERAL INTELLECTUAL
LEVEL
9. INDIVIDUAL TESTS
4. NORMS AND THE INTERPRETATION OF Stanford-Binet Intelligence Scale 230
TEST SCORES Wechsler Adult Intelligence Scale 245
Statistical concepts 68
Wechsler Intelligence Scale for Children 2.'55
Wechsler Preschool and Primary Scale of Intelligence 260
Developmental norms 73
Within-group norms 77
Relativity of norms 88 10. TESTS FOR SPECIAL POPULATIONS
Computer utilization in tile interpretation of test scores 94
Infant and preschool testing 266
Criterion-referenced testing 96 Testing the physically handicapped 281
Cross-cultural testing 287
5, RELIAB ILITY
The correlation coefficient 104
Types of reliability 110 Croup tests versus individual tests 299
Reliability of speeded tests 122 Multilevel batteries 305
Dependence of reliability coefficients on the sample tested 125 Tests for the college level and beyond 318
Standard error of measurement 127
Reliability of criterion-referenced tests 131
12. PSYCHOLOGICAL ISSUES IN
INTELLIGENCE TESTING
Longitudinal studies of intelligence 327.
Content validity 134 Intelligence in early childhood 332
Criterion-related validity 140 Problems in the testing of adult intelligence 337
Construct validity 151 Problems in cross-cultural testing 343
Overview 158 Nature of intelligence 349
7. VALIDITY: MEASUREMENT AND

PART 4
INTERPRET ATION
TESTS OF SEPARATE AInLJTIES
Validity coefficient and error of estimate 163
Test validity and decision theory 167
Moderator variabll;;s 177
13. MEASURING MULTIPLE APTITUDES
Combining information from different tests 180 Factor analysis 362
Use of tests for cl.assification decisions 186 Theories of trait organization 369
Statistical analyses of test bias 191 MUltiple aptitude batteries 378
Measurement of creativity 388
8. ITEM ANALYSl-S
14. EDUCATIONAL TESTING
Item difficulty 199
Item validity 206 Achievement tests: their nature and uses 398
Internal consistency 215 General achievement batteries 403
Item analysis of speeded tests 217 Standardized tests in separate subjects 410
Cross validation 219 Teacher-made classroom tests 412
Item-group interaction 222
20. OTHER ASSESSMENT TECHNIQUES
Diagnostic and criterion-rdt:renced tests 417 "Objective" performance tests 588
Specialized prognostic tests 423 Situational tests 593
Assessment in early childhood education 425 SeH-concepts and personal constructs 598
Assessment techniques in behavior modification programs
~ OCCUPATIONAL TESTING Observer reports 606
Biographical inventories 614
\V Validation of industrial tests 435 The assessment of environments 616
Short screening tests .for industrial personnel 439
Special aptitude tests 442
Testing in the profeSSions 458
Diagnostic use of intelligence tests 465

Special tests for detecting cognitive dysfunction
Identifying specific learning disabilities 478
Clinical judgment 482
B. Guidelines on Employee Selection Procedures (EEOC)
Report writing 487 Guidelines for Reporting Criterion-Related and
Content Validity (OFCC)
PART 5
PERSON ALITY TESTS
17. SELF-REPORT INVENTORIES

Content validation 494
Empirical criterion keying - 496
Factor analysis in test development 506
Personality theory in test development 510
Test-taking attitudes and response sets 515
Situational specificity 521
Evaluation of personality inventories
18. MEASURES OF INTERESTS, ATTITUDES,

AND VALUES ;527
Interest inventories 528
Opinion and attitude measurement 543
Attitude scales 546
Assessment of values and related variables 552
19. PROJECTIVE TECHNIQUES

Nature of projective techniques 558
Inkblot techniques 559
Thematic Apperception Test and related instruments
Other projective techniques 569
Evaluation of projective techniques 576
PART 1
C01ltext of
. Psychological Testing
CHAPTER 1
Functions and 01~igiTlS of
Psycllological TeStiTlg
'
NYONE reading this book today could undoubtedly illush'ate what
A . is meant by a psychological test, It would be easy enough to recall

. a test the reader himself has taken in school, in college, in the
armed services, in the counseling center, or in the personnel office. Or
perhaps the reader has served as a subject in an experiment in which
standardized tests were employed. This would certainly not have been the
case fifty years ago. Psychological testing is a relatively young branch of
one of the youngest of the sciences.
Basically, the function of psychological tests is to measure ,9.:iffe~~~.n~L_

1Jetween individuals or between the reactions of the same individual on
different occasions. One of the first problems that stimulated the develop-
ment of psychological tests was the identification of the mentally re-
tarded. To this day, the detection of int~i1ectual deficiencies remains an
Important application of certain types of psychological tests. Related
clinical uses of tests include the examination of the emotionally disturbed,
the delinquent, and other types of behavioral deviartts. A strong impetus
to the early development of tests was likewise provided by problems
arising in education, At present, schools are among the largest test users.
The classifica.tiOIlOfchildren with reference to their ability to profit
from different types of school instruction, the identi£ication of the in-
tellectually retarded on the one hand and the gifted on the other, the
diagnosis of academic failures, the educational and vocational counseling
of high school and college students, and the s~~ction of applicants for
professional and other special schools are among the many educational
~uses of tests.
The selection and classification of industrial personnel represent an-
other major application of psychological testing. From the assembly-line
4 COllfcl't of Psychological Testing
operator or filing clerk to top management, there is scarcely a type of job make the individual either n skilled examiner and test administrator or
for which some kind of psychological test has not proved helpful in such an"experf on test construction. It is directed, not to the test specialist, but
matters as hiring, job assignment, transfer, promotion, or termination. to the general student of psychology. Some acquaintance with the lead·'
To be sure, the effective employment of tests in many of these situations, ing current tests is necessary in order to understand references to the use
es eciiill-"Tri('Onnection with high-level jobs, usuall • re uires that the of such tests in the psychological literature. And a proper evaluation and
t!.:ts he used as an adjunct to s -i u interviewing, so that test scores interpretation of test results must ultimately rest on a knowledge of how
may be properly int~rpreteaTnt1leli ht of other back ound' rmatiQn the tests were constructe<l, what they can be expected to accomplish, and
a out the m IVI un. evertheless, testing constitutes an important part what are their peculiar limitations. Today a familiarity with tests is re-
~ total personnel program. A closely related application of psycho- quired, not only b~' those who give or construct tests, but by the general
logical testing is to be found in the selection and classification of military psychologist as well.
personnel. From simple beginnings in "Vorld 'War I, the scope and A brief overview of the historical antecedents and origins of psychologi-
variety of psychological tests employed in military sihlations underwent cal testing will provide perspective and should aid in the understanding
a phenomenal increase during World War II. Subsequently, research of present-day tests.' The direction in which contemporary psychological
on test development has been continuing on a large scale in all branches testing has been progressing can be clarified when considered in the light
of the armed services, of the precursors of such tests. The special limitations as well as the
The use of tests in counseling has gradually broadened from a nar- advantages that characterize current tests likewise become more intel-
rowly defined guidance regarding educational and vocational plans to ligible when viewed against the background in which they originated.
an involvement with all aspects of the person's life. Emotional well- The roots of testing are lost in antiquity. DuBois (1966) gives a pro-
being and effective interpersonal relations have become increasingly vocative and entertaining account of the system of civil service examina-
prominent objectives of counseling. There is growing emphasis, too, on tions prevailit:\g in the 'Chinese empire for some three thousand years.
the use of tests to enhance self-understanding and personal development. Among the ancient Greeks, testing was an established adjunct to the
Within this framework, test scores are part of the information given to educational process. Tests were used to assess the mastery of physical as
the individual as aids to his own decision-making processes. well as intellectual skills. 'the Socratic method of teaching, with its
It is clearly evident that psychological tests are currently being em- interweaving of testin and t~hin has mch i mmon with toda 's
ployed in the solution of a wide range of practical problems. One should rrograme earning. From their beginnings in the middle ages, European
not, however, lose sight of the fact that such tests are als? serving impor- umversities relied on formal examinations in awarding degrees and
tant functions in basic research Nearly all problems in differential psy- honors. To identify the major developments that shaped contemporary
chology, for example, require testing procedures as a means of gathering testing, however, we need go no farther than the nineteenth century. It
data. As illustrations, reference may be made to studies on the nature and is to these developments that we now turn,
extent of individual differences, the identification of psychological traits,
the measurement of group:' differences, ~nd the investigationfijo]ogical
and cUltural factors associated WIth 6ehavioral differences. For all such EARLY INTEREST IN CLASSIFICATION AND
areas of research-and for many others-the precise mt>.asurement of TRAINING OF THE MENTALLY RETARDED
individual differences made possible by well-constructed tests is an
essential prerequisite. Similarly, psycholOgical tests provide standardized The nineteenth century witnessed a strong awakening of interest in the
tools for investigating such varied problems as life-span developmental humane treatment of the mentally retarded and the insane. Prior to that
changes within the individual, the relative effectiveness of different edu- time, neglect, ridicule, and even torture had been the common lot of these
cational procedures, the outcomes of psychotherapy, the impact of unfortunates. With the growing concern for the proper care of mental
community programs, and the influence of noise on performance.
I A more detlliled account of the early origins of psycllOlogical tests can be found
From the many different uses of psychological tests, it follows that some
in Goodenough (1949) and J. Pefers~n (1926~. See also Boring (1950) and Murphy
knowledge of such tests is needed for an adequate understanding of most
and Kovach (1972) for more general backgrq~md, DuBois (1970) for a brief but
fields of contemporary psychology. It is primarily with this end in view comprehensive history of psychologi~l tClsting, and ,Anastasi (1965) for historical
that the present book has been prepared. The book is not designed to antecedents of the study of individual differences.
6 Context of Psychological Testing Functions and Origins of Psychological Testing 7
deviates came a realization that some uniform criteria for identifying and his fellow members of the Society for the Psychological Study of the
classifying these cases were required. The establishment of many special Child, Binet stimulated the Ministry of Public Instruction to take steps to
institutions for the care of the mentally retarded in both Europe and improve the condition of retarded children. A specific outcome was the
America made the need for setting up admission standards and an ob- <'stablishment of a ministerial commission for the study of retarded chil-
jective system of classification especially urgent. First it was necessary to dren, to which Binet was appointed. This appointment was a momentous
differentiate between the insane and the mentallv retarded. The former event in the history of psychological testing, of which more will be said
manifested emotional disorders that might or might not be accompanied Jal'er.
by intellectual deteriomtion from an initially normal level; the latter were
characterized essentially by i~tellectual defect that had been present
from birth or early infancy. What is probably the first explicit statement
of this distinction is to be found in a two-volume work published in 1838
by the French physician Esquirol (1838), in which over one hundred The ~arly experimental psycholOgists of the nineteenth century were
pages are de\'oted to mental retardation. Esquirol also pointed out that not, in general, concerned \vith the measurement of individual'differ-
there an! many degrees of mental retardation, varying along a continuum ences. The principal aim of psychologists of that period was the fomm-
from normality to low-grade idiOCy. In the effort to develop some system lation of generalized descriptions of human behavior. It was the
for claSSifying the different degrees and varieties of retardation"Esguiroi uniformities rather than the differences in behavior that were the focus
tried several procedures but concluded that the individual's use of lan- of attention. Individual differences were either ignored or were accepted
guage provides the m05t de endable criterion of his intellectual level. It as a necessary evil that limited the applicability of the generalizations.
is meres mg to note t at current criteria 0 menta retardation are also Thus, the fact that one individual reacted diHerently from another when
largely lingUistic ant! that present-day intelligence tests are heavily observed under identical co~ditions was regarded' as a form of -etror.
loaded ~vith Yerbal content. The important part verbal ability plays in The presence of such error, or individual variability, rendered the
our concept of intelligence will be repeatedly demonstrated in subsequent generalizations approximate rather than exact. This was the attitude
chapters. toward individual differences that prevailed in such laborotodes as that
Of special significance are the contributions of another French physi- founded by '''undt at Leipzig in 1879, where many of the early experi-
cian, S,egll~. who pioneered in the training of the mentally retarded. mental psychologists received their training.
Having rejected the prevalent notion of the ineurability of mental re- In their choice of topics, as in many other phases of their work, the
tardation , SeO'uin
v
(1866) eXIJerimented for many~ "vears with what he founoers of experimental psychology reBected the influence of their back-
termed the physiological method of training; and in 1837 he,:es,tal:6hed grounds in physiology and physics. The problems studied in their labora-
the nrst school devoted to the education of mentally reta .." ~hildren. tories were concerned largely with sensitivit~ to ~al, auditory, and~
In 1848 he emigrated to America, where his ideas gaine _ ide recog- other sensory stimuli and \vith simple reaction time. This emphasis on
nition. Man~- of the sense-training and muscle-trainirJg techniques cur- sen~ory phenome~a was in tU!'l1reflected in the nature of the £rst psycho-
rently in use in institutions for the mentally retarded \vere originated by logICal tests, as will be apparent in subsequent sections.
Seguin. By these methods, severely retarded children are given intensive . St:ilI another way in which nineteenth-century experimental psychology
exercise in sensory discrimination and in the development of motor con- Influenced the course of the testing movement may be noted. ,The earlv
trol. Some of the procedures developed by Seguin for this purpose were ps~'chological experiments brought out the need for rigorous control
'eventually incorporated into performance or nonverbal tests of intelli- of the conditions under which observations were made. For example, the
gence. An example is the Seguin Form Board, in which the individual \\'?rding of directions given to the subject in a reaction-time experiment
is required to insert variously shaped blocks into the corresponding mIght appreci~bly incre.ase or decrease the speeg 'i\t the subject's re-
recesses as quickly as possible. sponse. Or agam, the bnghtness or color oEthe sUtr~,,~:ding field could
More than half a century after the work of Esquirol and Seguin, the mar~edly alter the appearance of a visu~J s~mulU~:".1\h~portance of
French psychologist Alfred Binet urged that children who failed to makmg observations on all subjects un4i~.,s~ndardiz~& conditions was
respond to normal schooling be examined before dismissal and, if con- ...!fu1svividly demonstrated: Such standardization of proce,dure eventually
sidered educable, be assigned to special classes (T. H. Wolf, 1973). With became one of the special earmarks of psychological tests.
Functions and Ol'igills of Psychological Testing 9
mathematically untrained investigator who might wish to treat test re-

sults quantitatively. He thereby extended enormously the application of
statistical procedures to the analysis of test data. This phase of Galton's
It "'as the English biologist Sir Francis Galton who ,,:as. primarily r~- work has been carried forward by many of his students, the most eminent
sponsible for launching the testing movem~l~t: A umfY~lg. factor ~n of whom was Karl Pearson.
Calton's numerous and vaI'ied research activities was hiS }nterest llL
'humaJ;rheredit ". In the course of his imestigations on heredity, Calton
t~a 'ize t e need for measuring the characteristics of related and un-
related persons. Only in this way could he discover, for example, the
exact degree of resemblance bet:w'een p~ren~s and offspring, 1;'rothers and . An especially prominent position in the development of psychological
sisters; cousins, or twins. With this end 11l View, Calton was mstrument~l ' testing is occupied by the American psychologist James McKeen Cattell.
in inducing a number of educational institutions to keep systematic The newly established science of experimental psychology and the still
anthropometric recOl:ds on their students. ~e al<;oset up an anthropo~ct- newer testing movement merged in Cattelfs work. For his doctorate at
ric laboratory at the International EXposI~on of ,18~4wh~re, by .pa) mg Leipzig, he completed a dissertation on individual differences in reaction
threepence, visitors could be measured 111 ce~yslcal traIts and !ime, despite Wundt's resistance to this t'ype of investigation. While lec-
could take tests of keenness of vision and hearing, muscular strength, tming at Cambridge in 1888, Cattell's own interest in the measurement
reaction time, and other simple sensorimotor functions. Whe~l the expo- of individual differences was reinforced bv contact with Calton. On his
sition closed, the laboratory was transferred to South Kensmgton Mu- return to America, Cattell was active both 'in the- establishment of labora-
seum, London, where it operated for six years. By such methods, the nrst tories for experimental psychology and in the spread of the testing
large, systematic body of data on individual differences in simple psycho- movement.
l -;;\-
In an article written by Cattell in ,,890, the term "mental test'. was
';e~
.
U-U..~
_
logical processes was gradually aceu~ulated. . . .
Galton himself devised most of the sun pIe tests admIDlstered at hIS an- used for the £rst time in the psychological literature. This article de-
thropometric laboratory, many of which are still familiar either in ~heir scribed a series of tests that were beinO'
o administered anlluallv. to college
original or in modified forms. Examples include the Cal~o~ bar for ,,:mual students in the effort to determine their irteilectuall~yel. The tests, which
,discrimination of len h, the Galton whistle for determmlllg the hlghest had to be administered individually, included measures of muscular
au i e pitch, and graduated series of weights for measurin? k~ne.sth~tic strength, speed of movement, sensiti~ty to pain, keenness of vision and
discrimimltion. It was Calton's belief that tests of sensory discrlrmnatlOn of hearing, weight discrimination, reaction time, memory, and the like. I
could serve as a means of gauging a person's intellect. In this respec,~' he In his choice of tests, Cattell shared Galton's view that Jl measure of/M-.,';';;.(,V1.""V'.-(~
was partly influenced hy the theories of L?cke. Thus Galton wrote: .The i,ntellectual functions could he Qbt<}ined through tests of sensorv cis,- f<.U4-~e.I..t., ;~~
only information that reaches us concernmg outward events appeals to c~pination and reaction time. Cattell's pI'eference for such tests was also !1~~tl<-.~
pass through the avenue of our senses; and the n~ore per~ptive the sen~es bolst.e~ed by the fact that simple functions could be measured with .p!i<ck{t<:1.<-lA.~J
are of difference, the larger is the field upon which our Judgment and 10- preCiSIOnand accuracy, whereas the development of objective measures1-<=~.M "..it-r I
telligence can act" (Calton, 1883, ~'. 27). C~lt~n !lad.:~lso noted that for the more complex functions seemed at that time a well-nigh hopeless r:YL-'
idiots tend to be defective in the ability to discrlmmaJe·:heat, cold, and task. ' .
pain-an observation that furtller strengthene5iYnis ~nviction that sens~ry Catten's tests were typical of those to be found in a number of test
discriminative capacity "would on the whole' be highest among the m- series developed during the Jast decade of the nineteenth century. Such
tellectualh- ablest" (Galton, 1883, p. 29). . test series were administered to schoolchilqren, college students', and mis-
Galton also pioneered in the application of rating-sca~c ~nd ques~lOn- ccllaneous adults. At the Columbian Exposition Jield in Chicago in 189~,
naire methods as well as in the use of the free associatIon techmque Jastraw set up an exhibit at which visitors wete"'iIllitted to take tests of
subsequently ~mployed for a wide ~arietyof purposes. A .fu.rther contri- sensory, motor, and simple perceptual processes and: to compare tlieir
bution of Galton is to be found in hiS development of statistical methods skill with the norms (J. Peterson, 1926; Philippe, 1894·~.A few attempts
for the analysis of data on individual differences. Galton selected and to evaluate such early tests yielded very discOuraging results: The indi-
adapted a n~mber of techniques previously derived ~y m~thematicians. vidual's Rerform~Dce showed little correspondence from one test to an-
These techniques he put in such form as to permit theIr use by the other (Sharp, 1~1899; Wissler, 1901), and it exhibited little or no
10 Context of PSlJc11010gical Testing Functions and Origi;ls of Psychological Testing 11
relation to independent estimates of intellectual levC:'1based on teachers' ously cited commission to study procedures for the education of retarded
ratings (Bolton, 1891-1892; J. A. Gilbert, 1894) or academic grades children. It was in connection 'with the objectives of this commission that
(Wissler, 1901). Binet, in collaboration with Simon, prepared the first Binet-Simon Scale
A number of test series assembled by European psychologists of the (Binet & Simon, 1905).
period tended to cover somewhat more complex functions. Kraepelin This scale, known as the 1905 seale, consisted of 30 problems or tests
(1895), who was interested primarily in the clinical examination of psy- arranged in ascending order of difficulty. The difficulty level was deter-
chiatric patients, prepared a long series of tests to measure what he re- mined empirically by administering the tests to 50 normal children aged
garded as basic factors in the characterization of an individual. The 3 to 11 years, and to some mentally retarded children and adults. The
tests, employing chiefly simple arithmetic operations, were designed to tests were designed to cover a wide variety of functions, with speCial
measure practice effects, memory, and susceptibility to fatigue and to dis- emphasis onJ.udgmt;nt, comprehension, and reasoning. Which Binet re-
traction. A few years earlier, Oehrn (1889), a pupil of Kraepelin, had garded as essential components of intelligence. Although sensory and
emploY€idtests of perception, memory, association, and motor functions perceptual tests were included, a much greater proportion of verbal
in an investigation on the interrelations of psychological functions. An- content was found in this scale than in most test series of the time. The
other German psychologist, Ebbinghaus (1897), administered tests of 1905 scale was presented as a preliminary and tentative instrument, and
arithmetic computation, memory span, and sentence completion to school- no precise objective method for arriving at a total score was formulated.
children. The most complex of the three tests, sentence completion, was In the second, or 1908, scale, the number of tests was increased, some
the only one that showed a clear correspondence with the children's unsatisfactory tests from the earlier scale were eliminated, and all tests
scholastic achievement. were grouped into age levels on the basis of the performance of about
Like Kraepelin, the Italian psychologist Ferrari and his students were 300 normal children between.. the ages of 3 and 13 Years. Thus, in the
interested primarily in the use of tests with pathological cases (Guicciardi 3-year level were placed all tests passed by 80 to 00 percent of normal
& Ferrari, 1896). The test series they devised ranged from physiological 3-year-olds; in the 4-year-Ievel, all tests similarly passed by normal 4-year-
measures and motor tests to apprehension span and the interpretation of olds; and so on to age 13. The child's score on the entire test could then
pictures. In an article published in France in 1895, Binet and Henri criti- be expressed as a mental level corresponding to the age of normal chil-
cized most of the available test series as being too largely sensory and as dren whose performance he equaled. In the various translations and
concentrating unduly on simple, specialized abilities. They argued further adaptations of the Binet scales, the term "mental age" was commonly
that, in the measurement of the more complex functions, great precision substituted for "mentalleveI." Since mental age is such a simple concept
is not necessary, since individual differences are larger in these functions. to~rasE> the introduction of this term undoubtedly did much to popu-
An extensive and varied list of tests was proposed, covering such func- larize intelligence testing.> Binet himself, however, avoided the term
tions as memory, imagination, attention, comprehension, suggestibility, "mental age" because of its unverified developmental implications and
aesthetic appreciation, and many others. In these tests we can recognize preferred the more neutral term "mental level" (T. H. \\Tolf, 1973).
the trends that were eventually to lead to the development of the famous A third revision of the Binet-Simon Scale appeared in 1911, the year of
Binet intelligence scales. Binet's untimely death. In this scale, no fundamental changes were intro-
duced. Minor revisions and relocations of specific tests were instituted.
More tests were added at several year levels, and the scale was extended
to the adult level
Even prior to the 1908 revision, the Binet-Simon tests attracted wide
Binet and his co-workers devoted many years to active and ingenious
> Goodenough (1949, pp. 50-51) notes that in 1881, 2l y~aTs befor~ the appear-
research on ways of measuring intelligence. Many approaches were tried, ance of the 1908 Binet-Simon Scale, S. E. Chaille publi!iheq in the New Orleans
including even the measurement of cranial, facial, and hand form, and Medical a~d Surgical Journal a series of tests for infan~ 11l7anged according to the
the analysis of handwriting. The results, however, led to a growing con- a!1:eat whIch the tests are commonly passed. Partly because' of the limited circulation
viction that the direct, even though crude, measurement of com lex of the journal 'nd partly, perhaps, because the scientific ~Om!J1l1nity was not ready
1 fence a unc ons 0 ere t e greatest promise. T en a specific situ- for it, the significance of this age-scale concept passed unnoticed at the time. Binet's
own scale was in~ed by the work oE some oE ~is contemporaries, notably Blin
ation arose that brought Binet's efforts to imme(]iate practical fruition. and Damaye, who prepared a set of oral questions from which they derived a single
In 1904, the Minister of Public Instruction appointed ~inet to the previ- global score Eor eaclrdiild (T. H. Wolf, 1973). .
Functions and Origins of Psyc1101ugical Testing 13
12 Context of Psyc11010gical Testing
fo~ g~n~ral routine te~ting; t~e latter was a nonlanguage scale employed
attention among psychologists throughout the world. Translation~ and WIth Illiterates and wIth foreign-born recruits who were unable to take a
adaptations appeared in many lang;uages. In Americ;l, a number of diHer- tcst in English. Both test~ w~re suitable for administratio~ to large groups.
ent revisions were prepa.red, the most famous of which is the one de- Shortly af~e~ the temunatlOn of "Vorld War I, the Army tests were re-
veloped under the direction of L. ~t Terman a.t Stanford University, and leased for cmhan use. Not only did the Army Alpha and Army Beta
known as the Stanfmd-Binet (Terman, 1916). It was in this test that the themselves pass through many revisions, the latest of which are even now
intelligence quotient (IQ), or mtio between mental age and chronologi- in use, b.ut they also sVVed as ~dels for most group intelligence tests.
cal age, was first used. The latest revision of this test is widely employed The te~ting .movement underwent a tremendous spurt of growth. Soon
today and will be more full\' considered in Chapter 9. Of special interest, group mtelhgence tests were being devised for all ages and types of
too. is the first Kuhlmann-Binet revision, which extended the scale down- ~ersons, from preschool children to graduate students. Large-sc~le test-
ward to the age level of 3 months (Kuhlmann, 1912). This scale repre- 109 progra~ns: previously impossible, were now being launched with
sents one of the earliest efforts to develop preschool and infant tests of ~est~ul optimIsm. Because group. tests were designed as mass testing
intelligence. lUsh uments, they not only permItted the simultaneous examination of
large groups but also simplified the instructions and adminish'ation pro-
cedu~es so as to demand a minimum of training on the part of the
exammer. Schoolteachers began to give intelligence tests to thcir classes.
Coll~ge studen~s were routinely examined prio~ to admission. Extensive
The Binet tests, as well as all their revisions, are indil;iclual scales in studies of specIal adult groups, such as prisoners, were undertaken. And
the sense that the\" can be administered to onlY one person at a time. soon the general public became IQ-conscious. "---
Man\' of the tests in these scales require .oral re~ponses from the subject T~e application of such group intelligence tests far outran their techni-
or n~cessitate the manipulation of materials. Some call for individual cal Improvement. That the tests were still crude instruments was often
timing of responses. For these and other reasons, such tests are not f?rgotten in the rush of gathering scores and drawing practical condu-
adapted to group administration. Another characteristic of the Binet type
slO~Sfrom the ~esults. 'Vhen. ~he tests failed to meet unwarranted expec-
of test is that" it requires a highly trained examiner. Such tests are es-
tations" skepticism and hostiht)' toward all testing often resulted. JJ1US.
sentiallv clinical instruments, suited to the intensive study of individual
J .' • the testi boom of the twenties, based on the indiscriminate use of tests IS
i?
cases. ma~ have ~one as much to retai' as to ad\'ance the progress of psvcho- ---
Group testing, like the first Binet scale, was developed to meet a press- logical test mg. - ~
ing practical need. When the United States entered l)!orld 'Var I in
1917, a committee was appointed by the American Psychological Associ-
ation to consider ways in which psychology might assist in the conduct of
the war. This committee, under the direction of !lobert 1.•.1. Yerkes, recog-
nized the need for the rapid classification of the million and a ha1f re- ~lthough intelligence tests were originally designed to sample a wide
cruits with respect to general intellectual level. Such informati~.~~~va:s
vanety of ~unctions in order to estimate the individual's general intelIec-
relevant to many admmistrative decisions, including rejection or dis-
tua~ level, It soon became apparent that such tests were quite limited in
charge from military service, assignment to different types of sel'vicei, or theIr .cove~age. Not all important functions were represented. IJ:!. fact,
admission to officer-training camps. It was in this setting that the first most mtelhgence tests were primarily measures of verbal ability and. to a
group intelligence test was developed. In this task, the Ar-m~' psycholo- lesser extent, of the ability to handle numerical and other abstract and
gists drew on all available test materials, and especially on an unpub-
symb~~ic re~ations. Gr~dually psychologists eame to recogni~e that the
lished group intelligence test prepared by ~rthur S. Otis, which hc ~erm . Il1telhgence test was a misnomer, since only certain aspects of
turned over to the Army. A major contribution of Otis's test, which he mtelligence were measured by such tests.
designed while a student in one of Terman's graduate courses, was the
To be sure, th~ tests cov~red abilities ,t~t are ot p.rime importance in
introduction of multiple-choice and other "objective" item types. our culture. B~ It was. realized that more'precise designations, in terms
The tests finally developed by the Army psychologists came to be of the type of mformation these tests are able to yield, w<;lUlq be prefer-
known as the ~rm""yAlpha and the Army Beta The former was designed
14 Context of Psyclwlo{!.ical Testing Functions and OrigillS of PSljchological Testing 15
able, For example, a number of tests that would probably have been present century. Subsequent methodological developments, based on the
caned intelligence tests during the twenties later came to be known as work of such American psychologists as T. L. ReIley (1928) and L. L.
scholastic aptitude tests. This shift ill terminology was made in l'ec:og- !hurs~one (1935, 194i), as well as on that of other American and English
nition of the fact that mallY so-called intelligence tests measure that ll1veshgators, have come to be known as "factor analvsis."
combination of abilities demanded by academic work. The contributions that the methods of factor ana'lysis have made to
E\'l'n prior to Vvorld War I, ps\'ch~logists had begun to recognize the test c'Onstruction will be more fully examined and ill~strated in Chapter
need for tests of spE'cial aptitudes to suppkment the global intelligence 1:3. For the present, it will suffice to note that the data gathered by such
tests. These s ecial a till/de tests ' , , _ ' procedures have indicated the presence of a Dumber of rebtiyely ;nde-
vocationa counseling and in the selection and classification of industrial J)endent factors. or traits. Some of these traits were represen'ted, in
and military ersonn~1. Among the most widely used are tests of.!!lechani- vary~ng proportions, in the traditional intelligence tests. Verbal compre-
ea , c erica, musical, and artistic aptitlldes. henSIOn and numerical reasoning are examples of this tvpe of trait.
-TI~ca~lation of intelligence tests that follm,'ed their wide- Others, such as spatial, perceptual, and mechanical aptitude~, were found
sl>\'eadand
, indiscriminate use durinlJ0 the twenties also revealed another more often in special aptitude tests than in intelligence tests.
lIote"iOlthy fact: an individual's erformance on ' One of the chief practical outcomes of factor analysis was the develop-
test often -showed mar -c variation. This ,yas especially apparent on ment of multiple aptitude batteries. These batteri('s arc desiuned to pro-
gl'OUptests, 111whlch the items ar~mmonly segregated into subtests of vide a measure of the individual's standing in each of a number of traits.
relath'e1\- homogeneous content. For example, a person might score rela- In place of a total score or IQ, a separate score is obtained for such traits
tively high on a verbal subtest and low on a numerical subtest, or vice as "erhal comprehension, numerical aptitude, spatial visualization, arith-
versa, To some extent, such internal variability is also discernible on a m~tic re~soning, and perce~tual speed, Such batteries thus provide a
test like the Stanford-Binet, in which, for example, all items involving SUItable mstrument for makin<1 the kind of intraindividual anaJ\'Sis I'
words might prove difficult for a particular individual, whereas itcms 1 e~'e ~nOSls, t at c inicians a een tr\'ing for matiy years to
employing pictures or geometric diagrams may place him at an ad- .obtam, wlth crude and often errODl:'OUSresults from intelligence tests.
vantage, These batteries also incorporate into a comprehensivl:' and svstl:'matic
Test users. and especially clinicians, frequently utilized such interc~l11- testing program much of the inform,ation formerly obtained fro~l special
parisons in order to obtain 1110reinsight into the individual's psychological aptihlde tl:'sts, since the multiple aptitude batteries cover some of the
make-up. Thus, not only tllC'IQ or other global score but also scores on traits not ordinarily me u e JlI IJ1e 1 ence tests.
subtests wonld lJt' examined in the e\'aluation of the indhidual case, Such , u tip e ap u e atteries represent a relatively late development in
a practice is not to be general1~' recommended, ho,~,('ver. ~)eeaus~ in- the testing field. Nearl~' all have appeared since 1945. In this connection,
tellig('J]ce tests were not designed for the purpose of ,dIHerel,~h,~11aphtude the work of thc military psychologists during World War II s.J~d also
anal;'sis. Often the subtests heing compared contain t0o,14C\\' items to be noted. ~fuch of the test research conducted in the armed services was
yield a stable or reliable estimate of a specific ability:;jis'a result, the based on factor analysis and was directed toward the construction of
obtained diffl:'rence betwcen subtest scores might be reversed if the mu.ltiple aptitude batteries. In the Air Force, for example, special bat-
individual were retestE'd on a different day or with another foml of the tent's were constructed for pilots, bombardiers, radio operators, range
same test. If such intraindividual comparisons are to be made, tests are finders, and scores of other military specialists. A report of the batterics
needed that are specially designed to reveal differences in performance prepared in the Air Force alone occupies at least nine of the nineteen
in various functions. volumes devoted to the aviation psychology program during 'Vorld War
While the practical apl)lication of tests demonstrated the l1~.ed for II (Anny Air Forces, 1947-1948). Research along these line~ is still in
differential aptitude tests, a parallel development in the stu,d)' of trait or- progress under the sponsorship of various branches of the armed services.
ganization was gradually providing the means for constructing SUC? tests. A.~~mber of multiple aptitude batteries !rl,\yelikewise ~en 4,eveloped for
Statistical studi('s on the nature of intelligence had been explonng the clVllian. use and are being widely appliel:l\,n educati0l1~l and vocational
iflterrelatiol1s among scores obtained by many persons on a ,,,ide variety counselmg and in personnel' selectioll and' cJassincadqIl. Examples of
of different tests, Such investigations were begun by the English ,psy- such butteries will be discussed in Chapter 13, ,"-' "
chologist Charles Spearman (1904, 1927) during the £lrst decade of the To avoid confusion, a point of terminology shoul\!l be clarified. The
16 COIl!ex! of Psyclwlogict,{ Tcsrillg FI/I1C!iol1.\' mltl Origi/l.~ of Psyc1IO/<l{!.ical Tcsli,l{!. 17
term "aptitude test" has been tracHtiollalJ" cmployed to refer to tests and other hroad educational objectives. The deeade of the 19:305 also
measuring relativel\" homo ('ncous and dparlv defined sc rn1C'nts of witnessed the introduction of test-seoring maehines, for which the new
• I I \., t le term "intelliO'ence
Co) e-. test" customarih' . refers to more hderogenc- ohjec:tive tests could be readily adapted.
~ests yielding a single global score sm:h as an IQ. S~)ecial aptitu~c The establishment of statewide, regional. and nalional testing programs
tests typically measure a single aptitude. ~lultiple al~tltl1de battenes ,,,as another noteworthy parallel denlopment. Probably the best known
measure a number of aptitudes but pro\"ide a profile of scores, one for .?f these programs is that of the College Entrance Examination Board
eaeh aptitude. ~t;EEB). Established at thc turn of the ce_ll'~' to reduce duplication in
the exa"tnining of entering college freshmen, this program has undergone
profound changes ill its testing procedures and in the number and nature
?f partie-ipa.ting col1eges-c·hangcs that reflect inten'ening developments
111both testIng and cducation. In 1947, the testing functions of the CEEB
While psychologists were busy developing intelligence and aptitude were llIerged with those of the Carnegie Corporation and the American
tests, traditional school examinations were undergoing a number of tech- Council on Education to form Educational Testing Service (ETS). In
nical improvements (Caldwell & Courtis, 192:3; Ebel & Damrin, 1960 ~. subscq.t1cnt ~'ears, ETS has assumed responsibility for a growing number
An important step in this direction was taken by the Boston pubhc of testlllg programs on behalf of universities, professional schools, gov-
schools in 1845, when written examinations wefe substituted for the oral ernment agencies, and other institutions. \[ention should also he made of
interroO'ation of students by visiting examiners. Commenting on this in- the American Collegc Testing Program established in 1959 to scrccn
nDvati~l, Horacc ~fann cit~d arguments remarkably similar to those used applicants to colleges not included i~ thc CEEB program, and of several
much later to justify the replacement of essay questions hy objective national testing programs for the selection of highl\' talented students
multiple-choice items. The written examiuations, \lann noted, put all for scholarship awards. .
students in a uniform situation, permitted a wider cO\'erage of content, . Achievem.ent tests are used not only for educational purposes but also
reduced the chance element in question choice, and eliminated tIll' pos- III the se]Pchon of applicants for industrial and government jobs. \fention
sibilitv of h\'oritism on the examiner's part. has already been made of the systematic use of ci\'i\ sen'jce examinations
Aft~r the turn of the centurv, the first stand-ardized tests for measuring in the Chinese empire, dating from 111.5 .B.c. In modern times, selection
the outeomes of school instnl~tion began to appear. Spearheaded h~' the of go\'~rnI~lent emplo:-e~s by examination was introduced in European
work of E. L. Thorndike. these tests utilized measurement principks de- countnes 111the late eIghteenth and eark nineteenth centuries. The
veloped in the psychological laboratory. Examples include scales for l!llited States Chi! Service Commission in~talled competitive examina-
rating the quality of handwriting and written compos.itiol1s, as. well ~s tions as a regular procedure in 1883 (Kanuck, 19.56). Test construction
tests in spelling, arithmetic computation, and arithmetic reasol1lng. Stl11 techniques developed during and prior to World "'a~ I were introduded
later came the achie\"ement batteries, initiated by the publication of the into tll<:'examination program of the United States Ch-il Service with the
first edition of the Stanford Achievement Test in 192:3. Its authors were appointment of L. J. O'Rourke as director of the newlv established re-
three earl" It'aders in test development: Truman L. Kelley, GHes ~f. search dh'ision in 1922. '
Ruch, ami Lewis M. Terman. Foreshadowing many characteri·stic'S of . As more and more psychologists trained in psychometrics participated
modern t'fsting, this battery provided com~arable measu~'es of perfo~- m the construction of standardized achievement tests, the technical as-
ance in different school subjects, evaluated 111 terms of a smgle norma live pects of achievement tests increasingly came to resemble those of in-
group. telligence and aptitude tests. Procedur~s for cons,trllcting and evaluating
At the same time, evidence was accumulating regarding the lack of all ~hese tcsts have much in common. The incre~s!ng effOlts to prepare
agreement among teachers in grading essay tests. By .1930 it was.widely achIevement tests that would measure the attainment of broad educa-
recognized that essay tests were not only more hme-cOnsumll1g for tional goals, as contrasted to the recall of factualiminutiae also made
examiners and examinees, but also yielded less reliable results than the the content of achievement tests resemble more -cioselv th~t of intelli-
"new type" of objective items. As the latter came into increasing use in ge~lce tests. Today the difference between these two 'types of tests is
standardized achievement tests, there was a growing emphaSiS on the dueHy one of degree of specificity of content and extent to which the
design of items to test the understanding and application of knowledge test presupposes a designated course of prior instruCtion.
J' IIIIC/ /(111,\ {///(/ (higill., of J'sydl(l'(/~i('111 1'<'S!iIlt!. 19
of bc-!Ja>ior 01' Wl'I'(' <:olll:erncd with mOl'(' dbtindly social r('~pons('s, such
as dOl1lmalll'C-sublllission in interpersonal ('ontacts. A later development
\\'as th<: constmction of tests for quantifying the expression of interests
Another area of psy<:holo~ical testing is concerned with the aH('ctive or and athtude's, These tests, too, W('H' based l'ssentialh' on <llll'stionnaire
t('chniqul's, .
nonint('lIectnal aspects of b('ha\'io!'. Tests d('signed for this purpose are
commonly known as personality tests. although some psychologists prefer .All(~th('rapproach to the measurement of personalit~' is through the ap-
to lISt' the term personalit~, in a hroader sense, to refer to the cntirc indi- pllc,\hon of performatlce or situational tests. In such tests, the subject has
vidual. Intellectual as well as nonintellectual traits ,,"ould thus be included a task to perform whose purpose is often disgUised, :\Iost of these tests
under this heading, In the terminology of psychologit·al testing, howcver, s~llIulate e\'eryday-life situations quite c1ose1~'.Th(' first extensive applica-
the designation "personality test" most often refers to measures of such tIOn o~ such tl'chniqnes is to be found in the h'sts de\'eloped in the late
characteristics as emotional adjustment, interpersonal relations, moth·a- twenhcs and earl~' thirties by Hartshorne, ~fa\', and their associates
(1928, 1929, 19:30), This series, standardized on s'choolchildren, was con-
tion, interests, and attitudes.
An earl~' precursor of personaJit~' testing may be r('cognizcd in Kra,:- cerned \:'ith such beha"ior as cheating, lying. stealing, cooperatin'ness,
pelin's use of the free association test with abnormal patients. In thIS and pcmstenct', Objective, quantitative scores could he obtained on each
test the subject is gh'en specially selectcd stimulus words and is required of a largc numb('r of sp('cific tests, A more recent illustration, for the
to r('spond to each with the first word that comes to mind, Kraepelin a~1I1.tlev;l, is l~ro\'ided by the series of situational tests developcd during
( 1892) also employed this technique to study the psychological effects " OJld "ar II 111 the Assessment Program of the Office of Strate<Tic Serv-
of fatigue, hunger, and drugs and concluded that all these agents in- ices (OSS, 19-48). These tests wem' C:Oll('erned with rclath·ely ~omplex
crease the relati\'{~ frequenc~' of superficial associations, Sommer (1894), and subtle sodal and emotional beha\'ior and refluired rather ehlborate
also writing: during the last decade of the nineteenth century, suggested f~cilities and tr~lin:d personnel for their admillistration, The interpreta-
that the free association test might be used to differentiate between the tIOn of th,e subject s responses, moreover, \\'as rdati\'C I~' suhjectivc.
various forms of mental disorder. The fre(' association technique has Pro,ectll;e techniqlles represent a third approach to the study of per-
subscqllenth' becn utilized for a vari('ty of testing purpos('s and is still sO,nall.tyand olle that has shown phenomenal gro\vth. cspecially among
curr('nth- en\plcn'ed, \Iention should also be made of the 'York of Galton, dlll1CIans. In such tests. the subject is gi\'en a relatin'Jy unstructured
Pear~on: and C;lttell in the dpyelopment of standardized questionnaire task that permits "'ide latitudl' in its solution, The assumption underlvincr
and ratin~-,~'ale tl'chniqn('s. Although origin~l1y devised for other pur- such metllocls is that the indi\'idual will project his characteristic m~d~:
poses. these proc-edmes \wre e\'entual1~' employed by othNs in construct- of response into stich a task. Lik(' the performancc and situational tests.
ing some of the most common types of current personality tests. proje~ti\'l' techniqucs are mor(' or less disguised in lhl:'ir purpose, thereby
The protntype of tht, personalit\' qnpstionnaire, or self-report inventory, reducmg the chances that the subject can dt'li1wrateh- create a desired
is the Per~(lnal Data Sheet developed by \Voodworth durin~ \"orId \Var impressi?l1, The prc\'iously cited free association test'represe.nts one of
I (DuBois. 1970; Symonds. 19:31,eh. 5; Goldlwrg, 19(1). This test was thc earlIest types of projccth'e techniques. Sellten('e-completion tests
designed as a rough screening device for identifying seriously ~urotic hav.e al.so been tlSed in this manner. Otller tasks commonly employed\n
men \\'110 would be' unfit for military service. The inventor\' conslst~d of proJech\'e techniques include drawing, arranging toys to create a scene,
a number of questions dealing with common neurotic sy~pt01'!lS, ,~'hich ('xtempor~nt'ous dramatic play. and interpreting pictures or inkblots.
the individual answered about himself. A total score was o\5t~ined by All.a\'aJlable types of personality t('sts present serious difficulties. both
counting the number of symptoms reported, The Personal Data ~heet practi~al and theoretical. Each approach has its own special advaqtages
was )lot completed carly enough to permit its operational use .J)efore the and. dlsad\:antages. On the whole, personality testing has lagged far
war cnded. Immediatel" after the war, however, civilian forms were behmd aptitude t('sting in its positive accomplishments. But such lack of
prepared, including a special form for use with children. The \Vood- progress is not to be attributed to insufficient eHOI't. Hesearch on the
worth Personal Data Sheet, moreover, served as a model for most subse- ~~~urement ~f. pers?nality ~as attained i~pr~s~ive Pl~p,p'ortions since
quent emotional adjustment inventories. In some of these questionnaires, . ' ~nd .man) mgemous devIC.'csand techmcal J1nprovemeil~s arc under
an attempt was made to subdivide emotional adjustment into more spe- ~VeStigabon. It is rathe,r the spt'cial difficulti~ encountel:fd in the
cific forms. such as home adjustment, school adjustment, and vocational easurement of personality that account for the slow advances in this
u~ . ,
adjustment. Other tests concentrated more intensively on a narrower area
imtruJl1cnts {,;m hr found in A SourcelJook for .Hell/(/I 11ealtll Measures
(Comn'~·. Backer, & Glaser, 197:1). Containing approximately 1,100 ab-
stracts. this sourcehook includes tests, questionnaires, rating scales, and
Psychological testing is in a state of rapid chan~e. There are shifting other <ledc('s for assessing both aptitude and personality variables in
oriel;tations, a constant stream of new tests, revisc>d forms of old tests, and adults and children. Another similar reference is entitled Measures for
additional data that mav refine or alter the interpretation of scores on Psychological Assessment (Chun, Cobb, & Frenrh, ] 975). For each of
existing tests. The accelerating rate of <:hange, together with ~he vast :1,000 measures, this volume' gives the original sOl\J'et' as well as an anno-
number uf available tests, makes it impracticable to sun'ey speCific tests tat<,d bibliography of the studies in which the measure was subscquently
in any single text. \lore intensive coverage of testing instruments and used. The entries were located through a search of 26 measurement-
problems in special areas can be found in books dealing with the us~ of related journals for the Years 1960 to 1970.
tests in such fields as counseling. clinical practice, personnel selection, Information 011 asses~ment devices suitable for children from birth to
and education. References to such publications are given in the appropri- 12 years is summarized in Tests and Measurements in Child Development:
ate chapters of this book. In order to keep abreast of current develop- A Handbook (Johnson & Bommarito, 1971). Covering only tests not listed
ments, however, anyone working with tests needs to be familiar with in the \nrr, this handbook describes instruments located through an
IlUoredirect sources of contemporary information about tests. intensi\'(~ journal search spanning a ten-year period. Selection criteria
One of the most important sources is the series of Mental !If easurements included availability of the test to professionals, adequate instructions
)'eaTbooks (MMY) edited hy Buros (19i2). Th('sc yearbooks cover nearly for administration and scoring, sufficient length, and convenience- of use
all commercially available psychological, educational, and vocational tests (i.p., not requiring expensive or elaborate equipment). A still more spe-
published in English. The coverage is especially .complete .for paper-~nd- cialized collection CO\'crs measures of social and emotional development
pencil tests. Eaeh yearbook includes tests publIshed dunng a speCified applicable to children between the ages of ,3 and 6 years (Walker, 1973).
period, thus supplementing rather than supplanting the earlier yearbooks. Finanv, it should be noted that the most direct source of information
The Ser,enth Mental Measurements r ear7JOok, for example, is concerned regardiI;!!: specific curr~ltksts is pro\'ided h~' the catalo~t1cs of tcst pub-
principally with tests appearing bet\\'een 1964 and 1~70. Tests. of con- lIshers and b~' tht· mannal that accompani0s ('ach test. A comprehensive
tinuing interest, however, may be reviewed r~peat('dly m StH.·cesSlyey~ar- list of test publishers, \\'ith addresses, can be found in the lates't Mell/al
hooks, as nt'w data accumulate from pertment research. The earhest M el/S/lTcmcnfs rearl)()ok~ For reach' reference, the namt's and nddrt'sses
publications in this series were merely bi~)liographies of tests: B~ginning of some of the largt'r .-\merican p'uhlishers and distributors of psycho-
in ]9,38, however, the ),earbook assumed Its ('UlTt'I\t form, wlll(:h llldudes logical tests are gi\'en in AppendiX D. Cltalog\1('s of current tests can be
critical reviews of most of the tests by one or more test experts, as well obtained from each of these publishers on requcst. :\lanuals and speci-
as a complete list of published references pertailling to each lest. .Routine men sets of tests can be purchased hy qualified users.
information regarding poblisher, -price, forms, and age of subjects for The test manual should provide the ('ssential infurmation required for
whom the test is suitable is also regularly giv('n. administering, scoring. and evaluating a particular test. In it should be
A comprehensive bibliography covering all types of published tests found full and detailed instructions, scoring key, norms, and data on re-
available in English-speaking countries is provided by Te:~ts in Print Iiahilit~, and validity. :\fo!'E'over, the manual should report the number
(Buras, 1974). Two related sources are Reading Tests and Reviett;~ and nature of subjects on whom lIonns, reliahilit~·. and validity were
(Bums, 1968) and Personality Tests and Reviews (Buras, 11970). Both est~b~ished, the methods employed in computing indices of reliability and
include a numbeF'~9f tests not found in any volume of the MMY, as well valIdity, and the specific criteria against which validity was checked. In
as master indexes'that facilitate the location of tests in the :\1\1Y. Reviews ~he e\'ent that the necessary information is too lengthy to fit conveniently
of specific tests are also published in several Ilsychological and educa- mto the manual, references to the printed sour<.:esin which such infor-
tional journals, such as the Journal of Educational Measurement and the mation can be readily located should be given. The manual should, in
JOllrnal of Counseling Psyc1101ogy. other. words, enable the test user to evaluate the ·test before choosing it
Since I9iO several sourcebooks have appeared which provide informa- for IllS specific purpose. It might be added that ma~y test manuals still
tion about u~published or little known instruments, largely supplement- fa!1 short of this goal. But some of the larger ancl more professionally
ing the material listed in the MMY. A comprehensive survey of such onented test publishers are giving increasillg attention to the preparation
22 Context of Psyc11010gical Testing
PU?-
of manuals that meet adequate scientific standards. An enlightened
lie of test users provides the firmest assurance that such standal'ds wIll
be maintained and improved in the future.. .
CHAPTER 2
A succinct but comprehensive guide for the evaluatwn of psy~hologlcal
tests is to be found in Standards for Educational arul Psyc11010glCal Tests
(1974), published by the American Psychological As~ocia~ion. These
J\rat1ure arld Use of
standards represent a summary of recommended practices 111 test con-
struction based on the current state of knowledge in the field. They are
concerned with the information about validity, reliability, norms, and
Psyclz.ological Tests
other test characteristics that ought to be reported in the manual. In their
latest revision, the Standards also provide a guide for the proper use of
tests and for the correct interpretation and application of test results.
Relevant portions of the StQnda~ds "ill.be cited in the following chapters,
in connection with the appropnate tOpICS. . introduction in Chapter 1 has already suggested
T
HE HISTORICAL
some of the many uses of psychological tests, as well as the wide
diversity of available tests. Although the general public may still
associate psychological tests most dosely with "IQ tests" and with tests
designed to detect emotional disorders, these tests represent only a small
proportion of the available types of instruments. The major categories of
psychological tests will be discussed and illustrated in Parts 3, 4, and 5,
'\'hich cover tests of general intellectual level, traditionally called intelli-
gence tests; tests of separate abilities, including multiple aptitude bat-
teries, tests of special aptitudes, and achievement tests; and personality
tests, concerned with measures of emotional and motivational traits, in-
terpersonal behavior, interests, attitudes, and other noncognitive char-
acteristics.
In the face of such diversity in nature and purpose, ,~hat are tIle
common differentiating characteristics of ps~'Chological tests? Ho," do
psychological tests differ from other methods of gathering information
about individuals? The answer is to be found in certain fundamental
features of both the construction and use of tests. It is with these featm!es
that the present chapter is concerned.
BEHAVIOR SAMPLE..-A, psychological test is essentially an objective

.~d standardized measure orit's'ample of behavior. Psychological tests
are like tests in any other science, insofar as 0R~flh~tions are made on a
small hut carefully chosen ,sample .~ . an ip~jyjil~)rs behaviQr.. In this
respect, the psychologist proceeds in much·.the 'Jame way as the chemist
who tests a patient's blood or a community.}swater supply by analyzing
,-et'more samples of it. If the psychologistwish¢'~ to test the extent
,iff a c1lild's vocabulary, a clerk's ability to perform arithmetic computa-
tions, or a pilot's eye-hand coordination, he ('xamim's their performance hroader sense, ho\\"('\'er, e\-en the diagnosis of present condition, suell as
with a representatin' set of wonls, :11'ithmclie prol>lems, or motor tests. mental retardation ur emutional disorder, implies a prediction of what
"'hetlwr or not the test adeqnately co\'(.'rs the behavior under con- the incIi\'idual will cIO in situations other than the present test. It is
sideration obviously depends on the number and nature of it nls in the logically Simpler to consider all tests as behavior samples from which
samp e. or examp e, an ant 1I1letJctest consisting of only five problems, predictions regarding other JX.havior can be made. Different typps of
~le including only multiplication items, would be a poor measure of tests can then be characterized as variants of this basic pattern.
the indiyidual's computational skill. A yoealmlary test composed entirely Anotlwr point that should be considered at the outset pertains to the
of baseball terms would hardly proYide a dependable estimate of a cone-ept of Clll}(/cify. It is entirely possible, for example, to dc\'isc a test
child's total range of vocalmlar~'. fur predicting how well an individual can learn Fre11Ch before he has
The diagnostic or 'redictiJ;c t;a7uc of a lsycholC!gical test depend~_ol! even begun the study of French. Such a test would invoh-e a sample of
the debH,',~O which it sen'es as an indicator of a relatively broad and the types of behavior required to learn the new language, but would in
!!guinea;t area·Ofb~;:. Measurement of the hehaYior sample directl~' itself presuppose no knowledge of French. It could then be said that
cO\'ered by the test is J:arely, if ever, the goal of psychological testing. this test measures the indh'idual's "capacity" or "potentialitt for learn-
The child's knowledge of a particular list of 50 words is not, in itself, of ing French, Such tenus should, hO"'ever, be used with caution in refer-
,great interest. Nor is the job applicant's performance on a specific set ence to ps~'dlOlogical tests. Onl\' in the senSe that a present behavior
of 20 arithmetic problems of much importune-e_ If, however, it can be sample can be used as an indicator of other, future behayior can we
demonstrated that there is a dose correspondence between the child's s~ak.()f a test measuring "capacity." Ko psychological test can do more
knO\dedge of the word list and his total l1laster~- of vocabulary, or be- than measurelJel1"UDor. 'Vh~ethci:S\1ch behavior can serve as an effective
tween the applicant's score on the arithmetic problems and his computa- inc!('x of other IX'hador can be determined only by empirical try-out.
tional performance on the joh. then the tests are ser\'ing their purpose,
It should be noted ir.. this connectiolJ that the test items need not
resemble closely the beha.vior the test is.to }[('dicr."It is only necessary STA:-;DARDIZATIO:-;, It ,,-:"iIlhe recalled that in the initial definition a ps~--
tna " .- on ence be demoHstrated bet"'ecn the tm); The chological test \\'as described as a standardized measure. Standardization
degrec of similarity between the test sample and the predicted behavior implies !miformifll of ~)rQcedllre in 'hdnl11Hsfenng and SCoring the 'test If
ma\' vary widely. At one extreme. the test mav coincide completelY with the scores obtained by different iudiyiduals are to be comparable, testin~
a part o'f the b;'h~or to he preclictt'cl. An e.\:Imple might be a foreign conditions must obYiously be the same for all. Such a requirement is only
vocabulary test in whi!=·htilt:' students are examilled on 20 of the 50 nt'\\- a speCial application of the need for controlled conditions in all scientific
words th~y have studied; another example is provided by the ro,ld test ohse-ryations. In a test situation, the single independent \'ariable is
taken prior to obtaining a driver's liccme. A lesser degree of similarity is usuall~' the indh-idual being tested.
illustrated by many vocational aptitude tests administered prior to joh In order to secure uniformity of testing conditions, the test constructor
training, in which there is only a mod<'rate rese ance between the provides detailed directions for administering each newly developed h:'st.
tasks peIformed on the joh and those incorporat ,in the test. At the The formulation of such directions is a major part of the standardization
other extreme one finds projecth'e personality test!>'" eh as the Rorschach of a new test_ Such standardization extends to the exact materials em
inkblot test, in which an attempt is made to predict from the subject's 'plo~d, time limits, oral instructions to subjects, prc>Jiminary demonstra-
as~ociations to inkblots how he will rcad to other people, to ~motionally : ~ns, ways of handling queries from subjects. and evel,\, other ~
toned stimuli, and to other complex, everyday-life situations, Despite the testing situation. :Many other, more subtle factors may influence the
their superficial differences, all these tests consist of samples of the indi- subject's performance on certain tests. Thus, in giving instructions or,
~s behavioL., And each mUst prove Its worth by" an empirically presenting problems orally, consideration must be given to the rate of
demonstrated correspondence between the subject's pcrformance on the speaking, tone of voice, inflection, pauses, and faCj~1 e}pression. In a
test and in other situations. test involving the detection of absurdities, tot eX;lnit>le, the correct an-
Whether the term "diagnosis" or the term "prediction" is employed in ~wer may be given away by smiling or paY~jlg wh~n the crucial word
this connection also represents a minor distinction. Prediction eommonly J~.read .. Stand~rdized testing p.rocedure, ~r:,~i[ th~\. ex.aminer's point of
connotes a temporal estimate, the individual's future performance on a \1:w, Will be dJscussed further m a later sect~g~ of-<tl;lJSchapter dealing
job, for example, heing foreeast from his present test performance. In a \\'Jth problems of test administration. ."
26 COlltext Of Psychological Testing Nature alld Use of Psychological Tests 27
Another important step in the standardization of a test is the establish- the discussion of standardization. Thus, the administration, scoring, and
ment of norms, Psychological tests have no predetermined standards of interpretation of scores are objective insofar as they are independent of
pli5singor fa'inng; an individual's score is evaluated by comparing it with the subjective judgment of the individual examiner. Anv one individual
the scores obtained by others. As its name implies, a norm is the normal should theoretically obtain the identical score on a test r~gardless of who
or average performance. Thus, if normal B-year-old children complete happens to be his examiner. This is not entirely so, of comse, since per-
12 out of 50 problems correctly on a particular arithmetic reasoning test, fect standal'dization and objectivity have not been attained in practice.
then the 8-year-old norm on this test corresponds to a score of 12, The But at least such objectivity is the goal of test consb'uction and has been
latter is known as the raw score on the test, It may be expressed as achieved to a reasonably high degree in most tests.
number of correct items, time required to complete a task, number of There are other major ways in which psychological tests can be prop-
errors, or some other objective measure appropriate to the content of the erly described as objective. The determination of the difficulty level of an
test. Such a raw score is meaninglcss until evaluated in terms of a suitable item or of a whole test is based on objective, empirical procedures. 'Vhen
set of norms, . Binet and Simon prepared their original, 1905 scale for the measurement
In the process of standardizing a test, it is administered to a large, of intelligence, they arranged the 30 items of the scale in order of in-
representative sample of the type of subjects for whom it is designed. creasing difficulty. Such difficulty, it will be recalled, was determined by
This group, known as the standardization sample, serves to establish the trying out the items on 50 normal and a few mentally retarded children.
norms. Such norms indicate not only the average performance but also The items correctly solved by the largest number of' children were, ipso
the relative frequency of varying degrees of deviation above and below facto, taken to be the easiest; those passed by relativdy few children were
the awrage. It is thus possible to evaluate different degrees of superiority regarded as more difficult items. By this procedure, an empirical order
and inferiority. The specific ways in which such norm" may be expressed of difficulty was established. This early ,:xarnple typifies the objective
will be considered in Chapter 4. All permit the designation of the indi- measurement of difficulty level, which is now common practice in psycho.
"idual's position with reference to the normative or standardization logical test construction.
sample. :l'ot only the arrangement but also the selection of items for inclusion
It might also be noted that norms are established for personality tests in a test can be determined by the proportion of subjects in the trial
. in esse!1tially the same way as for aptitude tests. The norm on a person- samples who pass each item. Thus, if there is a bunching of items at the
ality test is not necessarily the most desirable or "ideal" performance, easy or difficult end of the scale, some items can be discarded. Similarly,
any more than a perfect or errorless score is the norm on an aptitude if items are sparse in celiain portions of the difficulty range, new items
test. On both types of tests, the norm corresponds to the performance of can be added to fill the gaps. More technical aspects of item analYsis
typical or average individuals. On dominance-submission tests, for ex- will be considered in Chapter 8. .
ample, the nonn falls at an intermediate point representing the degree
of dominance or submission manifested by the average individual.
Similarly. in an emotional adjustment inventory, the norm does not . RELIABILITY. How good is this test? Does it really work? Thes£l ques-
ordinarih· correspond to a complete absen<.'C of unfavoral;>le or mal- t~ons could-and occasionally do-result in long hours of futile discus-
adaptive' }'esponses, since a few such responses occur in the majority of sIOn. Subjective opinions, hunches, and personal biases may lead, on the
"normal" individuals in the standardization sample. It is thus apparent one hand, to extravagant claims regarding what a particular test can
that psychological tests, of whatever type, are bascq'· on lmpirically acco~plish and, on the other hand, to stubborn rejection. The only way
established norms. q~estlOns sU~h ~s these can be conclusively answered is by,empirical
trial. The olJ]ectlve evaluation of psychological tests involves primarilv
t?e d~tennination of the reliability and the validity of the test in specified
Sltuatlons.
OBJECTIVE MEASUREMENT Reference to the definition
OF DIFFICULTY.
of a psychological test with which this discussion opened will show that As used in psychometrics, the term reliability always means consis-
such a test was characterized as an objective as well as a standardized tenc~', Test reliability is the consistency of scores obtain_ed;~ the same
measure. In ,••.hat specific way~.are such tests objective? Some aspects of persons when retested with the identical test or with an eqRhYalent form
the objectivity of psychologieh'l tests have already been touched on in of the test. If a child receives an IQ of 110 on Monday and an IQ of 80
when retested on Friday, it is obvious that little or 110 confidence can be test. The validity coefficifnt enables us to determine how closel\' the
put in either score. Similarly, if in olle set of 50 words an individual criterion perfor~ance could have been predicted from the test scor~s.
identifies 40 correctl~·, whereas in another, supposedly equivalent set he In a similar manner, tests designed for other purposes can be validated
gets a score of only 20 right, then neither score can be taken as a de- against appropriate criteria. A vocational aptitude test, for example, can
pendable index of his verbal comprehension. To be sure, in both illustra- be validated against on-the-job success of a trial group of new employees.
tions it is possible that only one of the two sC'ores is in error, but tlus A pilot aptitude battery can 1;>evalidated against achie\'ement in flig:lt
could be demonstrated only by further retests. From the given data, we training. Tests designed for broader f\nd more varied uses are validated
can conclude only that both scores cannot be right. \Vhether one or against a number of criteria and their validity can be established only by
neither is an adequate estimate of the individual's ability in vocabulary the gradual accumulation of data from many different kinds of investiga-
cannot be established without additional information. tions.
Before a psychological test is released for general use, a thorough, The reader may have noticed an apparent paradox in the concept of
objective check of its reliability should be carried out. The different types test validity. If it is necessary to follow up the subjects or in other ways
of test reliability, as well as methods of measuring each, will be con- to obtain independent measures of what the test is trying to predict, why
sidered in Chapter 5. Reliability can be checked with reference to not dispense v.ith the test? The answer to this riddle is to be found in the
I temporal fluctuations, the particular selection of items or behavior sample distinction between the validation l,TfOUp on the one hand anci the groups
constituting the test, the role of different examiners or scorers, and other on which the test will eventually be employed for operational purposes
aspects of the testing situation. It is essential to specify the type of re- on the other. Before the test is ready for use, its validity must be estab-
liability and the method employed to determine it, because the same test lished on a representative sample of suhjects. The scores of these persons
may vary in these different aspects. The number and nature of indi- are not themselves employed for operational purposes but serve only in
viduals on whom reliability was checked should likewise be reported. the process of testing the test. If the test proves valid b~' this method, it
With such information, the test user can predict whether the test will be can then be used on other samples in the absence of criterion measures.
about equally reliable for the group with 'which he expects to use it, or It might still be argued that we would need only to wait for the crite-
whether it is likelv to be more reliable or less reliable. rion measure to mature, to become available, on any group in order to
obtain the information that the test is trying to predict. But such a pro-
cedure would be so wasteful of time and energy as to be prohibitive in
VALIDITY, Undoubtedly the most important question to be asked about most instances. Thus, we could detennine which applicants will succeed
any psychological test"concerns its validity, i.e., the degree to which the on a job or which students will satisfactorily complete college by admit-
test actually measures what it purports to measure. Validity provides a ting all who apply and waiting for subsequent developments! It is the
direct check on how well the test fulfills its function. The determination very wastefulness of this procedure-and its deleterious emotional im-
of validity usually requires independent, external criteria of-whatever the pact on individuals-that tests are designed to minimize. By means of
test is nesigned to measure. For example, if a medical aptitude test ist9 tests, the person's present level of prerequisite skills, knowledge, and
be used in selecting promising applicants for medical school,. ultimatle other relevant characteristics can be assessed with a deferminable margin
success in medical scholYlwould be a criterion. In the process of ·y~lidat- of error. The more valid and reliable thef~, the smaller will be this
ing such a test, it would be administered to a large group of students at ,margin of error. .
the time of their admission to medical school. Some measure of per- The special problems encountered in determining the validity of dif-
formance in medical school would eventually be obtained for each stu- ferent types of tests, as well as the specific criteria and statistical pro-
dent on the basis of grades, ratings by instructors, success or failure in cedures employed, willlJ~ fhscussed in Chapters 6 and 7. One further
completing training, and the like. Such a composite measure constitutes point, however, should be coq$fdered at this time. Validitv tells us more
the criterion with which each student's initial test score is to be correlated. than the degree to which the te~t is f~lfilling its funcpari.ft actually tells
A high correlation, or validity coefficie,,!t, would signify th~t those indi- us what the test is measuring. By studying the validation data, we can
viduals who scored high on the- test. had been relatively successful in objectively determine what the test is measuring. It would thus be more
medical school, whereas those scoring low on the test had done poorly in accurate to define validity as the extent to which we Jrnow what the test
medical school. A low correlation would indicate little correspondence measures. The interpretation of test scores would undoubtedly be clearer
l,,,t"'ppn tp~t ~('orp. rind criterirJn measure and hence poor validity for the and less ambiguous if tests were regularly named in terms of the criterion
Context of Psychological Tes/ing
'~:~hl:ough
which they had been validated. A tendency in this direction When applied to an intelligence test, however, it is likely that such
pe'recognized in such test labels as "~cholastic aptitude test" and specific training 01' coaching will raise the scores on the test without ap-
sonnel classification test" in place of the vague title "intelligence preciably affecting the broader area of beha"ior the test tries to sample.
Under such conditions. the validity of the test as a predictive instl'l1ment
is reduced.
The need for a qualified examiner is evident in each of the three major
'SONS FOR CONTROLLING THE USE OF aspects of the testing situation-selection of the test, administration and
,CHOLOCICAL TESTS scoring, and i~terpretation of scores. Tests cannot be chos'en like lawn
mowers, from a mail-order catalogue. They cannot be evaluated by name,
'y I:have a Stanford-Binet blank? ~fy nephew has to take it next week for; author, or other easy marks of identification. To be sure, it requires no
i~sion to,School X and I'd like to give him ~ol1lepractice so he can pass." psychological training to consider such factors as cost, bulkiness and ease
of transporting test materials, testing time required, and ease and rapidity
o improve the reading program in our school, we need a culture-free IQ of scoring. Information on these practica] points can '\lsually be obtained
,t .that measures each child's inllate potential."
from a test catalogue and should be taken into account in planning a test-
st night I answered the questions in an intelligence test published in a ing program. For the test to serve its function, however, an e"nlnation of
~gazine and I got an IQ of SO-I think psychological tests are silly." its technical merits' in terms of such characteristics as validity reliability
.. 'y roommate is studying psych. She gave me a personality test and I came difficulty level, and norms is essential. Only in such a way' ~an the tes~
user determine the appropriateness of an)' test for his particular purpose
1neurotic. I've been too upset to go to class ever since."
and its suitability for the type of persons with whom he plans to use it.
, 'ast ~'enryou gave a new personality test to our employees for research pur- The introductory discussion of test standardization earlier in this chap-
.;poses.We would now like to have the scores for their personnel folders."
ter has ah'eady suggested the importance of a trained examiner. An ade-
The above ·remarks are not imaginary. Each is based on a re~fincident, quate realization of the need to follow instructions precisely, as well as a
nd the list could easily be extended by any psychologist. SuQ't remarks thorough familiarity with the standard instructions, i~ required if the test
'lustrate potential misllses or misinterpretations of psychological tests in scores obtained by different examiners are to be comparable or if anyone
uch wavs, as to rrnder the tests worthless or to hurt the indi:,V;idual.Like individual's score is to he evaluated in terms of the published norms.
ny sd~ntillc instrument or precision tool, psychological t~~s"roJ!~.LP.!:_ Careful conh-ol of testing conditions is also essential. Similarly, incorrect
9perly used to be effective. In the hands of either the unscrupulous or or inaccurate scoring may render the test score worthless. In the absence
"we -meamng ut uninformed user, such tests can cause serious of proper checking procedures, scoring errors are far more likeh- to occur
than is generally realized. . ,\
~~~ ~
. There are two principal reasons for controlling the use of psychological The proper interpretation of test scores requires a thorough under-
ests: (a) to revent general familiarity with test content, which would standing of the test, the individual, and the testing <'Onditiolls. What is
.' invalidate the test an ( to ensure tat e test is used ~ a qualified :> being measured can be objectively determined only by reference to the
, '~\' if an individual were to merr'lbrize the correct' re- specific procedures in terms of which the particular test was validated.
O' sponses on a test o'f' color blindness, such a test w~ld no longer be a Other information, pertaining to reliability, nature of the group on which
'measure of color vision for him. Under these condItions, the test would norms were established, and the like, is likewise relevant. Some back-
be completely invalidated. Test content clearly has to be restricted in ground data reg,arding the individual being tested are essential in inter-
, order to forestall deliberate efforts to fake scores. preting any test score. The same score may be obtained by different per-
In other cnses, however, the effect of familiarity may be less obvious, sons for very different reasons. The conclusions to be drawn from such
or the test may be invalidated in good faith by misinformed persons. A scores would therefo.re be quite dissimilar. Finally, some consideration
\ ,schoolteacher, for example, may give her class special praettee in prob- must also be given to special factors that may have influenced a particular
.1ems closely resembling those on an intelligence test, "so that the pupils score, such as unusual testing conditions, temporary emotional or physical
will be well prepared to take the test." Such an attitude is simply a carry- state of thl> subject, and extent of the subject's previous experience with
"over from the usual procedure of preparing for a school examination. tests.
Nature alld (he of PsycllOlogiclIl Tc'sls 33
· J
needed should be carefully counted, checked, and arranged in advance
of the testing day.
Thorough familiarity with the specific testing procedure is another im-
The basic rationale of testing im·olves generalization from the behavior portant prerequisite in both individual and group testing. For individual
sample observed in the testing situation to beha"ior manifested in other, testing, supervised training in the administration of the particular test is
nontest situations, A test SCOl'e should help us to predict how the client usually essential. Depending upon the nature of the test and the type of
will feel and act outside the clinic, how the student will achieve in col- subjects to be examined, such training may requi.re from a few demonstra-
lege courses, and how the applicant will perform on the job. Any influ- tion and practice sessions to over a year of instruction. For group testing,
ences that are specific to the test situation constitute error variance and and espeCially in large-scale projects, such preparation may include
reduce test validity. It is therefore important to identify any test-related advance briefing of examiners and proctors, so that each is hilly in-
influences that may limit or impair the generalizability of test results. fonned about the functions he is to perform, In general, the examiner
A whole volume could easil\' be devoted to a discussion of desirable reads the instructions, takes care of timing, and is in charge of the group
procedures of test administration, But such a survey falls outside the in anyone testing room. The proctors hand out and collect test materials,
scope of the present book. Moreover, it is more pra~ticable to acquire make certain that subjects are following instructions, answer individual
~.such techniques within specific settings, because no one person would questions of subjects within the limitations specified in the manual, and
normally be concerned with all forms of testing, from the examination prevent cheating.
of infants to the clinical testing of psychotic patients or the administra-
tion of a mass testing program for military personnel. The present discus-
sion will therefore deal principally with the common rationale of test TESTING COXDlTlOXS. Standardized procedure applies not only to verbal
administration rather than with specific questions of implementation. For instructions, timing, materials, and other aspects of the tests themselves
detailed suggestions regarding testing procedure, see Palmer (1970), but also to the testing environment. Some attention should be iven to
Sattler (1974), and Terman and Merrill (1960) for individual testing, the selection of a . ~ flijJ.. This room should be
and Clemans (1971) for group testing. hould wvide , venti-
~ .~cial~
should a so e ta -en to prevcnt mtcrrup ons unng the test. Posting a
ADVASCE PREPARATIOS OF E."I:AMINERS. The most important requirement
sign on the door to indicate that testing is in progress is effective, pro-
for good testing proc;.edure is advanc-e preparation. In testing there can vided all personnel have learned that such a sign means no admittance
he no emergencies. Special efforts must therefore be made to foresee and under any circumstances. In the testing of large groups, locking the doors
forestall emergencies. Only in this way can unifom1ity of procedure be or posting an assistant outside each door may be neeessarv to-prevent the
..a{ls.\wed. entrance of late-comers. --
'Advance preparation for the testing session takes many forms. Memo- . It is important to realize the extent to which testing conditions may
rizingthe exact verbal instructions is essential in most individual testing. lI1fluence scores. Even apparentl~' ·minor aspects of the testing situation
Even ill a group test in which the instructions are reauto the subrects, may appreciably alter performance. Such a factor as the use of deSKSor
some· previous familiarity with the statements to be read prevents mis- of chairs with desk arms, for example, proved to be significant in a group
reading and hesitation and permits a more natural. informal ;manner dur- testing project with high school students, the groups using desks tending
ing test admillish'ation. The preparation of test materials is an9ther im- to obtain higher scores (Kelley, 1~43:Traxler & Hilkert, 1942). There is
portant preliminary step. In individual testing and especially in the ad- also evidence to show that the Slli9ir~loyed may affect
ministration of performance tests, such preparation invqlves the actual test scores (Bell, Hoff, & Hoyt,-19t3~1~li'~1lfr-~~ab1ishment of in-
layout of the necessary materials to facilitate subsequent use with a dependent test-scoring and data-processing agencies that;, provide their
minimum of search or fumbling. Materials should generally be placed on 0\1.'11machine-scorable answer sheets, examiners sometimes administer
a table near the testing ta.~le so that they are within easy reach of the group tests with answer sheets other than those lIsed in the standardiza-
examiner but do not distriCt Vte subject. When apparatus is employed, tion sample. In the absence of empirical verification, the equivalence of
frequent periodic checking and calibration may be necessary. In group these answer sheet# cannot be assumed. The Differential Aptitude Tests,
testing, all test blanks, answer sheets, special pencils,· or other materials for example, may be administered with any of five different answer
Context of Psychological Testing Natml.' anel USe' Of Psychological Tests 35
eets. On the Clerical Speed and Accuracy Test of this battery, separate behavior; in certain projective tests, they call for full reporting of associa-
s are provided for three of the five answer sheets, because they were tions evoked by the stimuli, without any censoring or editing of content.
nd to yield substantially different scores than those obtained with the Still other kinds of tests may require other approaches. But in all in-
reI' sheets used by the standardization sample. stances, the examiner endeavors to motivate the subject to follow the
testing children below the fifth grade, the use of (Illy separate answer mstructlOns as fullv and conscientiously as he can.
t may significantly lower test scores (Meh'opolitan Achievement Test The training of examiners covers techniques for the establishmcnt of
ial Report, 19i5). At these grade levels, having the child mark the rapport as well as those more directly related to test administration. In
\'ers in the test booklet itself is generally preferable. establishing rapport, as in other testing procedures, uniformity of condi-
any other, more subtle testing conditions have been shown to affect tions is essential for comparability of results. If a child is given a coveted
ormance on ability as well as personality tests. Whether the ex- prize whenever he solves a test problem correctly, his performance can-
inel' is a stranger or someone familiar to the subjects may make a not be directly compared with the norms or with that of other children
'nificant difference in test scores (Sacks, 1952; Tsudzuki, Hata, & Kuze, who are motivated only, with the standard verbal encoura"ement 0 01'
57). In another study, the general manner and behavior of the exam- praise. Any deviation from standard motivating conditions for a particular
, as illustrated by smiling, nodding, and making such comments as test should be noted and t,aken into account in interpreting performance.
ood" or "fine," were shown to have a decided effect on test results Although rapport can be more fully established in individual testing,
"ickes, 1956). In a projective test requiring the subject to write stories steps can also be taken in group testing to motivate the subjects and re-
'fit given pictures, the presence of the examiner in the room tended to lieve their anxiety. Specific techniques for establishing rapport vary with
hibit the inclusion of strongly emotional content in the stories (Bern- the nature of the test and with the age and other characterbtics of the
ein, 1956). III the administration of a typing test, job applicants typed subjects. In testing preschool children, special factors to be considered
'a significantly faster rate when tested alone than when tested in groups include shyness with strangers, distractibility, and negativism. A friendly,
liHwo or more (Kirchner, 1966). cheerful, and relaxed manner on the part of the examiner helps to reas-
Examples.could readily be multiplied. The implications are threefold. sure the child. The shy, timid child needs more preliminary time to be-
.first, follow standardized procedures to the minutest detail. It is the re- come familiar with his surroundings. For this reason it is better for the
onsibility of the test author and publisher to descdbe such procedures examiner not to be too demonstrative at the outset. but rather to wait
ully and clearly in the test manual. Second, record any unusual testing until the child is ready to make the first contact. Test periods should be
onditions, however minor. Third, take testing conditions into account br~ef, and the ~asks should be varied and intrinsically interesting to the
;hcn interpreting test results. In the intensive assessment of a person chll.d.. The testIng should be presented to the child as a game and his
rough individual testing, an experienced examiner may occasionally de- cunoslty aroused before each new task is introduced. A certain flexibilitv
rt from the standardized test procedure in OJ:der to eJi~it additional in- of procedure is necessary at this age level because of possible refusal~,
rmation for special reasons. \Vhen he docs so, he ~ no longer in- loss of interest, and other manifestations of negativism.
rpret the subject's responses in terms of the test norms, Under these Children in the first two or three grades of elementary school present
rcumstances, the test stimuli are used only for qualitative exploration; many of the same testing problems as the preschool child. The game ap-
. ld the responses should be treated in the same way as any other infor- proach is still the most effective way of arousing their interest in the test.
"malbehavioral observations or interview data. The older schoolchild can usually be motivated through an appeal to his
competitive spirit and his desire to do well on tests. 'Vhen testing chil-
dren from educationally disadvantaged backgrounds or from different
cultures, however, the examiner cannot assume they will be motiyated to
excel on academic taSKSto the same extent as children in the starfdardiza-
In psychometrics, the term "rapport" refers to the examiner's effOl'ts ti~n sa~~le ..This pro~le~ and others pertaining to the testing of persons
o arouse the subject's interest in the test, elicit his cooperation, and \\ lth diSSImilar expenential backgrounds will be c'Onsidered further in
nsure that he follows the standard test instructions. In ability tests, the Chapters 3, 7, and 12.
nstructions call for careful concentration on the given tasks and for put- . Special. motivational problems may be encountered in testing emo-
'ng forth one's best efforts to perform well; in personality inventories, tionally disturbed persons, prisoners, or juvenile delinquents. Especially
ey call for frank and honest responses to questions about one's usual when examined in an institutional setting, suca persons are likely ·to ..
manifest a number of unfavorable attitudes, such as suspicion, insecurity, valid score, Le., a score correctly indicating wh~lt he can do rather than
fl'ar, or cynical indifh'renee. Abnormal conditions in their past experiences overestimating or underestimating his abilities. ~Iost persons will under-
are also likely to influence their test perforrnanee adversely. As a result stand that an incorrect decision, which might result from invalid test
of early failures and frustrations in school, for example, they may have scores, would mean subsequent failure, loss of time, and frustration for
developed feelings of hostility and inferiority toward academic tasks, them. This approach can serve not only to motivate the individual to
\rhich the tests resemble. The experienced examiner makes special efforts try his best on ability tests but also to reduce faking and encourage frank
to establish rappolt under these conditions. In any event, he must be reporting on personality inventories, because the examinee realizes that
sensitive t~ these special difficulties and take them into account in inter- he himself would otherwise be the loser. It is certainly not in the best
preting and explaining test performance. interests of the individual to be admitted to a course of study for which
In testing any school-age child or adult, one should bear in mind that he is not qualified or assigned to a job he cannot perform or that he
e\'e1')'test presents an implied threat to the individual's prestige. Some would find uncongenial.
reassurance should therefore be given at the outset. It is helpful to ex-
plain, for example, that no one is expected to finish or to get all the itcms
correct. The individual might otherwise experience a mounting sense of
failure as 11e advances to the more difficult items or finds that he is un-
able to finish anv subtest within the time allowed. :\lany of the practices designed to enhance rapport sen'e also to reduce
It is also desil:able to eliminate the element of surprise from the test test anxiety. Procedures tending to dispel surprise and strangeness from
situation as far as possible, because the unexpected and unknown are the testing situation and to reassure and encourage the subject shottld
likely to produce al1xiet~'. :Many group tests provide a prdiminaryex- certainly help to lower anxiety. J'he examiner's own manner and a well-
planatory statement that is read to the group by the examiner. An even organized, smccthly running testing operation will contribute toward the
better procedure is to announce the tests a few days in advance and to same goal. Individual differences in test anxiety have been studied with
give each subject a printed booklet that explains the purpose and nature hoth schoolchildren and college students (Ga~dry& Spielberger, 1974;-
of the tests, offers general suggestions on how to take tests, and contains Spielberger, 19i2). Much of this research was initiated bv Sarason and
a few sample items. Such explanatory booklets are regularly available to his associates at Yale (Sarason, Davidson, Lighthall, "'aite, & Ruebush,
participants in large-scale testing programs such as those conducted by 1960). The first step was to construct a questionnaire to assess the indi-
the College Entrance Examination Board (1974a, 1974b). The United vidual's test-taking attitudes. The children's form, for example, contains
States Employment Service has likewise de\'eloped a booklet on how to items such as the following:
take tests, as well as a more extensive pretesting orientation~.technique
for use with culturally disadvantaged applicants unfamili~f. ,v'ith tests. Do you worry a lot before taking a test?
\1ore general orientation booklets aie also .available, si'tc11 as l\feeting
\\'hen the teacher sa~'s she is going to find out how much you h,we learned,
the Test (Anderson, Katz, & Shimberg, 1965), A tape recOl'ding and two
does your healt begin to beat faster?
booklets are combined in Test Orientatioll Procedure (TOP), designed
specifically for job applicants with little prior testing experience CBen- While 'you are taking a test, do you usually think you are not doing wen.
nett & Doppelt, 1967), The first booklet, used together with the tape,
provides general information on how to take tests; the second contains Of primary interest is the finding that both school achievement and intel-
practice tests. In the absence of a tape recorder, the examiner may read ligence test scores yielded significant negative correlations with test anx-
the instructions from a printed script. iety. Similar correlations have been found among college st1tdcn!s (1. G.
Adult testing presents--some additional problems. Unlike the school- Samson, 1961). Longitudinal studies likewise revealed an inverse relation
child, the adult is not so likely to work hard at a task merely because it is between changes in anxiety level and changes in inteJligence or achieve-
assigned to him. It therefore becomes more important to "sell" the pur- ment test perfonnance (Hill & Sarason, 1966; Sarason, Hill, & Zim-
pose of the tests to the adult, although high school and college students bardo, 1964). .
also respond to such an appeal Cooperation of the examinee can usually ~uch findings, of course, do not indicate the direction of caUsal relation-
;be secured by convincing him that it is in his own interests to obtain a slllps. It is possible that children develop test anxiety because they per-
\,
Context of Psydl(Jlogical Testiug
form poorly on tests and haw thus experienced failure and frustration in
previous test situations. In support of this interpretation is the finding
that \\ithin subgroups of high scorers on intelligence tests, the negative
"rrelation between anxiet~' level and test performance disappears Comprehensive surveys of the effects of examiner and situational
Denny, 1966; Feldhusen & Klausmeier, 1962). On the other hand, there variables on test seores'lmve been prepared by S. B. Sarason (1954),
5 evidence suggesting that at least some of the relationship results from Masling (l~60), ~foliarty (1961, 1966), Sattler and Theye (1967),
he deleteLious effects of anxiety on test performance. In one study Palmer (19,0), and Sattler (1970, 1974). Although some effects have
(:Waite,Sarason, Lighthall, & Davidson, 1958), high-anxious and low- been demonstrated with objective group tests, most of the data have been
, 'iotlschildren equated in intelligence test scores were given repeated obtained with either projective techniques or individual intelligence tests.
ials in a learning task Although initially equal in the learning test, the These extraneous factors are more likely to operate with unstructured and
w-allxiousgroup improved significantly more than the high-anxious. ambiguous stimuli, as well as "ith difficult and nO"el tasks, than with
Severalinvestigators have compared test performance under conditions clearly defined and well-learned functions. In general, children are more
esigned to evoke "anxious" and "relaxed" states. Mandler and Sarason f susceptible to examiner and situational influences than are adults; in the
examination of preschool children, the role of the examiner is especially
;;(.1952), for example, found that ego-involving instructions, such as telling
subjects that everyone is expected to finish in the time allotted, had a cruCiaL. Emotionally disturbed and insecure persons of an\' age are also
beneficialeffect on the performance of low-anxious subjects, but a dele- mClre likely to be affected by such conditions than are well-adjusted
teriouseffect on that ofbigh-anxious subjects. Other studies have likewise persons,
foundan interaction between testing conditions and such individual char- There is considerable evidence that test results may vary systematically
~cteristicsas anxiety level and achievement motivation (Lawrence, 1962; as a function of the examiner (E. Cohen, 1965; ~'Iasling, 1960). These dif-
Palll & Eriksen, 1964). It thus appears likely that the r~latjQn between ferences may he related to personal characteristics of the examiner, such
anxiety,and test performance is nonlinear, a slight amount Qf anxiety as his, age, sex, race, professional or socioeconomic status, training and
,\lein bencficia~ while a lar e amount is detrimental. Individuals who are expenence, personality charaderistics, and appearance. Se\'eral studies of
',cllstomariy ow-anxious benefit from test con i,tions t lat arouse some thes~ examiner variables, however, have yielded misleading or illcon-
et:>, ",hi e t lose who are customarilv hi<rh-anxiol1s )erform better cluSl\'e results because the experimental designs failed to control or iso-
Ii' firmore re axe can itions. late the influence of differcnt examiner or subject characteristics. Hence
it is undoubtedl\' true that a ~hronicalh- high amidv len'l will c:I;erJ a thp l:'ffeds of two or more variables ma\, be confounded.
deb'imental effect 'on school learning and' int~lIectual dewlopllleltf,_",~~ch The examiner's behavior before and during test auministration has also
"aneffect, howe\'er, should be distinguished horn the tesr:tiinit1!,r- ~'ects heen s~lown to affect test results, For example, controlled investigations
with which this discussion is concerned. To what extent do~s test auxier.· ha\'e YIelded significant differences in intelligence test performance as a
,make the individual's test performance unrepresentative of his cust~mar~' res~lt of a "warm" versus a "cold" interpersonal relation between ex-
;'performance level in nontest situations? Because of the competitive pre~- amllJer and examinees, or a rigid and aloof versus a natural manner on
sure experienced by college-bound high school seniors in ,,\merica today, the part of the examiner (Exner, 1966; Masling, 1959). Moreover, there
it has been argued that performance on c'OlIege ~dmissif>il tests may be may be Significant interactions between examiner and examinee' charac-
t " , h
unduly affected by test anxiety. In a thorough ana::4ontrol1ed investi. e~lstJCs,III t e sen~e that the same examiner characteristic or testing man-
gationof this question, French (1962) compar~d Jhf'p,erformancc of high nel may have a dIfferent effect on different examinees as a function of
school students on a test given as part of the fe-gular administration of the examinee's Own personality characteristics. Similar interactions may
the SAT with performance on a parallel form of the test administered at occur '~ith task variables, such as the nature of th,e test, the purpose of
,a different time under "relaxed" conditions, The instructions on the latter the testing, and the instructions given to the subjects. Dyer (1973) adds
, occasion specified that the test was given for 'research purposes only and even more variables to this list, calling attention to the possible inHirence
of th t t· , d . ," .
scores would not be sent to any college. The results showed that per- . c es gIVers an the test takers' diverse perceptions of the funetiglls
and goals of testing.' '
formance was no poorer during the standard administration than during St'll '
the relaxed administration. Moreover, the concurrent validitv of the test • '. I. an,other way in which an examin8r may inadvertently affect the
scores against high school course grades did not differ signifi~antly under ~x~~m~e s responses is through ~is own 'cexpectations, This is simply a
the two conditions. P clal mstance of the self-fulfilhng prophecy (Rosenthal, 1966; Rosen-
40 Context of Psycholog.ical Testing Natufa aile! Use of Psychological Tests 41
thaI & Rosnow, 1969). -An experiment conducted with the Rorschach will was in a control group given no preceding test and in one that had taken
illustrate this effect (Masling, 1965). The examiners were 14 graduate a standard verbal comprehension test under ordinary conditions.
student volunteers, 7 of whom were told, among other things, that ex- Several studies have been concerned with the effects of feedback re-
perienced examinel's elicit more human than animal responses from the garding test scores on the individual's subsequent test performance. In a
subjects, while the other 7 were told that experienced examiners elicit particularly well-designed investigation with seventh-grade students,
more animal than human responses. Under these conditions, the two Bridgeman (1974) found that "success" feedback was followed by sig-
groups of examiners obtained significantly diHerent ratios of animal to nificantly higher performance on a similar test than was "failure" feed-
human responses from theh subjects. These differences occurred despite hack in subjects who had actually performed equally well to begin with.
the fact that neither examiners nor subjects reported awareness of any This type of motivational feedback may operate largely through the goals
influence attempt. ~foreover, tape recordings of all testing sessions re- the subjects set for themselves in subsequent performance and may thus
vealed no evidence of verbal influence on the part of any examiner. The represent another example of the self-fulfilling prophecy. Such general
examiners' expectations apparently operated through subtle postural and motivational feedback, however, s1)ould not be confused with corrective
facial cues to which the subjects responded. feedback, 'whereby the individual is informed about the specific items he
Apa~ from the examiner, other aspects of the testing situation may missed and given remedial instruction; under these conditions, feedback
Significantly affect test performance. Military recmits, for example, are is much more likely to improve the performance of initially low-scoring
often examined shortly after induction, during a period of intense read- persons.
justment to an unfamilim' and stressful situation. In one investigation The examples cited in this section illustrate the wide diversity of test-
designed to test the effect of acclimatization to such a situation on test related factors that may affect test scores. In the majority of well-admin-
performance, 2,724 recruits were given the Navy Classification Battery istered testing programs, the influence of these factors is negligible for
during their ninth day at the ~a\'al Training Center (Gordon & Alf, practical purposes. Nevertheless~ the skilled examiner is constantly on
1960). When their scores were c'Ompared with those obtained by 2,180 guard to detect the possible operation of such factors and to mipimize
recruits tested at the conventional time, during their third day, the 9-day their influence. When circumstances do not permit the control of these
group scored Significantly higher on all subtests of the battery. conditions, the conclusions drawn from test performance should be
The examinees' activities immediately preceding the test may also af- qualified.
fect their performance, especially when such activities produce emotional
disturbance, fatigue, or other- handicapping conditions. In an investiga-
tion with third- and fourth-grade schoolchildren, there was some evidence
to suggest that IQ on the Draw-a-Man Test was influenced Qrthe chil-
dren's preceding classroom activity (McCarthy, 1944). On one occasion, In evaluating the eHect of coaching or practice on test scores, a funda-
the class had been engaged in writing a composition on "The" Best mental question is whether the improvement is limited to the specific
Thing That Ever Happened to Me"; on the second occasion, they had items included in the test or whether it extends to the broader area of
again been writing, but this time on "The Wo~sLThing That Ever'Hap- ~ehavior that the test i~gned to p;edict. The answer to this ques~
pened to Me." The IQ's on the second test, fOllowing what may have represel1ts the difference between coacmng and education. Obviously
been an emotionally depressing experience, averaged 4 or 5 points lo\ver any educational experience the indiVidual undergoes, either formal or in-
than on the first test. These findings were corroborated in a later investi- formal, in or out of school, should be reflected in his performance on tests
gation specifically designed to determine the effect of immediately pre- sampling the relevant aspects of behavior. Such broad influene.es will in
eeding experience on the Draw-a-Man Test (Reichenberg-Hackett, 1953). no way invalidate the test, since the test score presents an aar:a,tate piC-
In this study, children who had had a gratifying experience involving the ture of the individual's standing in the abilities under conside~n. The
successful solution of an interesting puzzle, followed by a reward of toys difference is, of course, one of degree. Influences cannot..:..be~dassified as
and candy, snowed more improvement in their test scores than those who either. narrow or broad, but obviously vary widely in scop~~f;om those
had undergone neutral or less gratifying experiences. Similar results were ~ffecting only a single a~lllinis~tj~n of a.,single test, throu~hJib.~se. affect-
obtained by W. E. Davis (1969a, 1969b) with college students. Per- ~ng'p~rformance on all Items ;()fi,ca /:crtUln,type, to those mtfUencmg the
fonnance on an arithmetic reasoning test was significantly poorer when mdl vidual's performance in the large .Irtai9rity of his activities. From the
preceded by a failure experience on a verbal comprehension test than it standpOint of effective testing, however, a workable distinction can be
COlltext of P~yc1lOlogic(/l Testing
e. Thus, it can be stated that a test score is inmlidated only when a \in'Sand is taught. Rather, this particular Scholastic Aptitude Test is a meas-
':'cular experience raises it withont appreciably affecting the criterion ure of abilities that seem to grow slowly and stubb(lrnl~'. profoundly influcllced
~Lviorthat: the test is deSigned to predict. by conditions at home and at school over thc years, but not responding to
hasty attempts to relive a young lifetime.
It should also be noted that in its test construction procedures, the Col.
:";{CHIKC. 'the effects of coaching on test scores have been widely in- lege Board im'estigates the susceptibility of new item types to coaching
gated. Many of these studies were conducted by British psycholo- (:\ngoH, 1971b; Pike & Evans, 1972). Item types on which perfo.rma1lce
,with special reference to the effects of practice and coaching on the can be appreciably raised by short-term drill or instruction of a narrowly
brinerly used in assigning ll-year-old children to different types of limited nature are not included in the operational forms of the tests..
'Ilrv;,schools (Yates et aI., 195:3-1954). As might be expected, the
ot ~~ovement depends on the ability and earlier educational;
'ences of'the examinees, the nature of the tests, and the amount and PRACTICE.The effects of sheer repetition, or practice, on test per-
'of coaching provided. Individuals with deficient educational back- formance are similar to the effects of coaching, but usuaIl~' less pro-
unds are more likely to benefit from special coaching than are those nounced. It should be noted that practice, as well as coaching, may alter
'ihave had superior educational opportunities and are already pre- the nature of the test, since the subjects may emplo~' different work meth-
, to do well on the tests. It is obvious, too, that the closer the re- ods in solving the same problems. Moreover, certain types of items may
,blance between test content and coaching material, the greater will be much easier when encountered a second time. An example is 'provided
the improvement in test scores. On the other hand, the more closely by problems requiring insightful solutions which, once attained, can be
truction is restricted to specific test content, the less likely is improve- applied directly in solving the same or similar problems in a retest. Scores
:nt to extend to criterion performance. on such tests, whether derived from a repetition of the identical test or
"n America, the College Entrance Examination Board has been con- from a parallel form, should therefore be carefully scrutinized.
hed about the spread of ill-advised commercial coaching courses for A number of studies have been concerned ~,'ith the effects of the
lege applicants. To clarify the issues, the College Board conducted identical repetition of intelligence tests over periods ranging from a few
veral well-controlled experiments to determine the effects of coaching days to se,'eral years (see Quereshi, ] 968). Both adults and children,
'its Scholastic Aptitude Test and surveyed the results of similar studies and both normal and mentally retarded persons have been employed. The
other, independent investigators (Angoff, 19711>;Conege Entrance studies have covered individual as well as group tests. All agree in show-
'amination Board, 1968). These studies covered a variety of coaching ing significant mean gains on retests. Nor is improvement necessarily
ethods and included students in both public and private high schools; limited to the initial repetitions. \Vhether gains persist or level off in suc-
e investigation was conducted with black students in 15 urban and cessive administrations seems to depend on the difficulty of the test and
'"ral high schools in Tennessee. The conclusion from all"these studies is the abilit~· level of the subjects. The implications of sucll findings are il- \
':at intensive drill on items similar to those on the SAT is unlikelY to lustrated by the results obtained in annual retests of .3,500 schoolchildren
'oduce appreciably greater gains than occur wrJ/i students are rete~ted with a Yariety of intelligence tests (Dearborn & Rothnev, 1941). When
'th the SAT after a year of regular high schot;il instruction. the same test was readministered in successive years, th~ median IQ of
On the basis of such research, the Trustees of the College Board issued the group rose from 102 to 113, but it dropped to 104 when another test
.formal statement about coaching, in which the fonowing points were w~s substituted. Becaus~ of the retest gains, the meaning of an IQ ob-
ade, among others (College Entrance Examination Board, 1968, tamed on an initial and later trial proved to be quite different. For exam-
p.8-9): ple, .a~ ~Q of 100 fell approximately at the average o£'lhe distribution on
the Im~lal trial, -but in the lowest quarter On a retest~S\ldl iQ's, though
e results of the coaching studies which ha,'e thus far been completed in-
te that average increases of less than 10 points on a 600 point scale can numencally identical and derived from the same te~ 1l;!ightthus signify
,expected. It is not reasonable to believe that admissions decisions can be normal ability in the one instance and inferior ability#},(,the other.
ected by such small changes in scores. This is especially true since the tests G~ins in score are also found on retesting with pili:dIel -forms <1j the
merely supplementary to the school record and other evidence taken into same tes~, although such gains tend in general to be .srh.a4Ier.Significant
. unt b'): admissions officers. . . , As the College Board uses the term, ap- m~a,n gams have been reported when altema"f~ forins ofa 'test were ad-
itude is not something flxed and impervious to influence by the way the child rnullstered in immediate succession or after intervals ranging from orie
Context of Psychological Tesring
b three years (Angoff, 1971b; Droege, 1966; Peel, 1951, 1952).

.r results have been obtained with normal and intellectually gifted CHAPTER 3
)children, high school and college students, and employee samples.
a "onthe distribution of gains to be expected on a retest with a parallel
should be provided in test manuals and allowance for such gains
. ~dbe made when interpreting test scores.
Social a1ld Etltical
)17 SOPHJSTICATIO~. The general problem o(test sophistication should
11JljJZicatioTls of Testi1lg
'"be considered in this connection. The individual who has had ex-
'vl! prior experience in taking psychological tests enjoys a certain ad-
Jage in test performance over one who is taking his first test (Heim &
, IIace,194~1950; Millman, Bishop, & Ebel, 1965; Rodger, 1936). Part
Ithis advantage stems from having overcome an initial feeling of xORDER to prevent the misuse of psychological tests, it has become
angeness, as well as from haVing developed more self-confidence and
"etter test"taking attitudes. Part is the result of a certain amount of over-
lap in the type of content and functions covered by many tests. SpeCific
I necessary to erect a number of safeguards around both the tests
themselves and the test scores. The distribution and use of psycho-
logical tests constitutes a major area in Ethical Standards of Psychologists,
,"familiaritywith common item types and practice in the use of objective the code of professional ethics officially adopted by the American Psycho-
"answer sheets may also improve performance slightly. It is particularly logical Association and reproduced in Appendix A. Principles 13, 14, and
important to take test sophistication into account when comparing the 15 are specifically directed to testing, being concerned with Test Security,
scores obtained by children from different types of schools, where the Test Interpretation, and Test Publication. Other principles that, 'although
extent of test-taking experience may have varied Widely. Short orienta- broader in scope, are highly relevant to testing include 6 (ConfideIi-
tion and practice sessions, as described em'lier in this chapter, can be tiality), 7 (Client Welfare), and 9 (Impersonal Services). Some of the
quite effective in equalizing test sophistication (Wahlstrom & Boersman, matters discussed in the Ethical Standards are closely related to points
1968). covered in the Standards for Educational and Psychological Tests (1974),
cited in Chapter 1. For a fuller ,and richer understanding of the principles
set forth in the Ethical Standards, the reader should consult two com-
panion publications, the Casebook on Ethical Standards of PsycllOlogists
(1967) and Ethical Principles in tIle Conduct of Researc11 with Human
Participants (1973). Both report specific incidents to illustrate each prin-
Ciple. Special attention is given to marginal situations in which there may
be a conflict of values, as between the advancement of science for human
betterment and the protection of the rights and welfare of individuals.
The requirement that tests be used only by appropriately qualified

examiners is one step toward protecting !he indiy!~ual againE: the im-
~oper use of tests. Qf course, the necessary qualiB,c~tions vary with the
type of test. Thus, a relatively long pe.ri!'d of int~nsive training and
s~pervised experience is required for the proper use of individual intel-
ligence tests and most personality tests, whereas a mini~um of specialized
psychological training is needed in the case of educational achievement
45
46 COllfext of Psycl1010gicaf Testing Social alief Etllicalll1lplications of Testing 47
or vocational proficiency tests. It should also be noted that students who selyices or use techniques that fail to meet profeSSional standards estab-
take tests in class for instructional purposes are not usually equipped to lished in particular fields" (Appendix A, Principle 2c). A useful distinc-
administer the tests to others or to interpret the scores properly. tion is that between a psychologist working in an institutional setting,
The well-trained examiner chooses tests that are a )ro riate for 0 such as a school system, university, clinic, or government agency, and one
the particular purpose for whie 1 e is teshn an t ex- engaged in independent practice. B~cause the in de endent ractitioner
amme. e IS a so cognizant of the available research literature on the is less subject to judC1ment and eva ua on l' wle eable collen es
clioseiitest and able to evaluate its technical merits with reC1 t lan lS 1e lIlS Itntional s choloC1ist he needs to meet hi her standards
o ard to such
character,istics as norms, reliability, and validity. In administering the ? - pro esslOna qualifications. The same would be true of a psychologist
test, he is sensitive to the many conditions that responSIble for the supervision of other i·nstitntional psychologists or one
~ such as those 1 ustrate 10 apter 2. He draws conclusions or who serves as an expert consultant to institutional personnel.
makes recommendations only after considering the test score (or scores) A Significant step, both in upgrading professional standards and in
in the light of other pertinent information about the individual. Above all, helping the public to identify qualified psychologists, was the enactment
lie shpuld be sufficiently knowledgeable about the science of human be- of state licensing and certification laws for psychologists. Nearly all states
havior to guard against unwarranted inferences in his interpretations of now have such laws. Although the terms '1icensing" and "certification"
test scores. When tests are administered' by psychological technicians or are often used interchangeably, in psychology certification typically refers
assistants, or by persons in other professions, it is essential that an ade- to legal protection of the title "psychologist," whereas licensing controls
quately qualified psychologist be available, at least as a consultant, to the practice of psychology. Licensing laws thus need to include a defini-
provide the needed perspective for a proper interpretation of test per- tion of the practice of psychology. In either type of law, the requirements
formance. are generally a PhO in psychology, a specified amount of snpervised
Misconceptions about the nature and purpose of tests and misinter- experience, and satisfactory performance on a qualifying examination.
pretations of test results underlie Illany of the popular criticisms of psy- Violations of the APA ethics code constitute grounds for revoking a
chological tests. In part, these difficulties arise from inadequate com- celtiRcate or license. Although most states began with the simpler certifi-
munication between· psychometricians and their various publics- cation laws, there has been continuing movement toward licensing.
educators, parents, legislators, job' applicants, and so forth. Probably th~ At a more advanced level, speCialty certification within psychology is
most common examples center on unfounded inferences kdfrtIQs. Not alT provided by the American Board of Professional Psychology (ABPP).
IU1sconcephons· about tests, howcyer, can bc attrib_R!;~ to inadequate ReeJuiring a high level of training and experience within deSignated
communication between psychologists and laymeD.)~'c.:hological testing specialties, ABPP grants diplomas in such areas as clinical, counseling,
itself has tended to become dissociated from~;.the· mainstream of be- industrial and organizational, and school psychology. The Biographical
havioral science (Anastasi, 1967). The growing.Fdrnplexity of the science Director~' of the APA contains a list of current diplomates in each spe-
of psychology has inevitably becn accompani~,dby increasingspecializa- cialty, which can also be obtained directly from ABPP. The principal
tion among psychologists. In this process, psychometricians have concen- f~nction of ABPP is to provide information regarding qualified psycholo-
trated more and more on the technical refinements of test construction gIsts. As a privately constituted board within the profession, ABPP does
and have tended to lose conta:tt wit'rr developments in other relevant ~)()thave the enforcement authority available to the agencies administer-
specialties, such as learning, child development, individual diffe;ences, mg toe state licensing and certification laws.
and behavior genetics. Thus, the technical aspects of test construction
have tended to outstrip the psychological sophistication with which test
results are interpreted. Test scores can be properly interpreted only in
the light of all available knowledge regarding the behavior that the tests
are designed to measure. .The. p~rchase of tests is generally restricted to persoJl~ ,who meet cer-
Who is a qualified psychologist? Obviously, with the diversification of tam z:nlmmal qualifications. The catalogues of major testp~1>lishers specify
the field and the consequent specialization of training, no psychologist is reqUlr~ments that must be met by purchasers, Usually ~pdividuals with a
equally qualified in all areas. In recognition of this fact, the Ethical mast~r s degree in psychology or its equivalent qu~l.i~~' -SO'rtle publishers
Standards specify: "The psychologist recognizes the boundaries of his claSSIfytheir tests into levels with reference to user qt;al~fi~~ions, ranging
competence and the limitations of his techniques and does not offer from educational achievement and vocational proficiency tests, through
'Context of Psychological Testing
Social alld Ethical Implicatiolls of Tes/ing 49
, , 'entories to such clinical instru-
ltelligence tests and mterest In\ t 'ersonalit tests, Distinc- either naIve credulity or indiscriminate resistance on the part of the pub-
s individual intelligence tests al ldmOhsPers alld a~thorized insti- lic toward aU psychological testing,
d' 'idua 1 [lUre as
e' alsohma db'e ofetween In ,1\
appropnate tes s,
t . Graduate students who may
, h Another unprofessional practice is testing by mail, An individual's per-
Pure asers "
. , f I ignment or for research , must have t e formance on eithel' aptitude or personalit~· tests cannot be properly as-
. articular test or ~ c ass a~s h "ehology instructor, who assessed by mailing test forms to him nnd lla\'ing him return them by mail
" order countersigned by t elf ps~ ,
'b'l' f' th oller use of the test. , for scoring and interpretation, Not only does this procedure provide no
sponsl 1 Ity 01 e'pr, f h a dual objective: secunty control of testing conditions but usually it nlso involves tIle interpretation
to restrict the distn~uboll o· ~ests ;~: Ethical Standards state: of test scores in the absence of other pertinent information about the in-
' .1 d prevenhon of mIsuse, 1 , t
atena san ' , I' 't d to persons \\1,'th professional mteres s dividual. Under these conditions, test results may be Worse than useless,
to such deVices IS ImI e , , 1 13)' "Test scores like test
' d h' "( Pnnclp e, ,
~ll safeguar t elr use who arc ualifled to interpret and
q
als, are rele::sed ~nl~ to perso~:sshould be noted that although test
m properly (Prmciple 14)" I t these obJ'cctives, the con-
k ' , efforts to Imp emen 'b'l'
utors ma 'e SllleCIe , '1 limited, The major responsl 1 Ity A question arising particularly in connection with personality tests is
Yare able to exert IS neeessan y h ' d' 'dual uscr or institution that of invasion of privacy, Insofar as some tests of emotional, motiva-
f 'd in t e 111 IVi
proper use 0 tests resl es h t MA degree in psychology tional, or attitudinal traits are necessarily disguised, the subject may re-
~ed,It is evident, ~or exampleA~;p a~i 'lorna-do not necessarily veal characteristics in the COurse of such a test without realiZing that he
en a PhD, state hc~nse, a~ld P articular test or that his is so dOing, Although there are few available tests whose appr~1ts
a
' hat the indi\'idualls quah~ed ~o u;e ti ;: of the results obtained subtle enough to fall into this category, the possibility of developing s'i1~1.r
is relevant to the proper mtel pre a 0
indirect testing procedures i~~ a grave responsibility on the pi.
at test. . . l' 'bilihr concern.s the Il1arketing of psvcho- , choIogist who uses them. F~~se61 ijf'te§..ting cliee:tii\'ene~,~. De,..
er professIOna lcsponsl '} h Id - t be released pre- necessary to keep the examinee"in'1gnQ.f~~ the speCific ~.):h
I d blishers Tests s Oll no
tests by aut lOrs an pu , I' be made regardincr the his l'esponses on any Oue test are to be int~fpreted, Xe\'er~~ •.a.1Jt'r_
' ' 1 N" h Id anv c aUllS b
V for <renera use, • 01 S ou '. b' t" c"l'dence 'I\'hen a son should not be subjected to any testing program under false pretenses,
o f fficient 0 Jec lye, .
f a test in the absence 0 su nI\, this condition should Of primary importance in this connection is the obligation to have a
d If search purposes 0 .'
'sttibute ear y or reo , , f the test n;'stricted accordingly, dear understanding with the examinee regarding the use that will be
y specified and the d~,S:'lb~tIOl~~e data to permit an evaluation made of llis test results, The- Jellowing statement contained in Ethical
manual should pro\ 1 e a, eq, re ardin administraUon, Standards of Psychologists (Principle 7d) is especially germane to this
est itself as well as full il1fo~~n~ttonfactal e~OSitiOl1 of what problem:
nd norn1S,The manual S IOU ,e a d vice:.desi ed t~;~t1t the
'b t tlle test rather than a sellmg c ;'" gn h d
a ou , , )onsibility of the test ,aut or an The psychologist who asks that an individual reveal personal information in
'favorable lIght, It IS the rfeSl h to prevent obsolescence, i the COurseof interviewing, testing, or evaluation, or who allows such infonna-
' dorms 0 ten enoug
r to reVise tests an n d t d 'II of course var)' tion to be divulged to him, does s9 only after making certain that the r:e-
idity with wlueh , a tes t be c ames out "a e WI, ,
sponsible person is fully aware oflhe purposes of the intervjew, testing, or
vith the nature of the tehst, ld t be published in a newsp.aper, evaluation and of the ways in which the information may be used,
' t of tests s ou no If
~~ ma °or UI:l'Sbook either for descri tive wrposes or forI SC -
e, or , " If 1 t' on would not on y b e Although concerns about the invasion of privacy have .been expressed
'00, Under these COndltI~:\;eW~~~~j~~ \vorthless, but it might most commonly about perspnalit)' tests, they logi<:ally apply to any type
, suchI drastic
. II errors
' , , ass to t I'le In d'1VI,
'dual Moreover , any pub- of test. Certainly any itlteJligence, aptitude, or achievement test may re-
~ycho ogl~a y mJ~nou will tend to invalidate the future use of veal limitations in skills and knowledge that an individual would rather
,n to speCIfic test It~~S 'ght also be added that presentation of 1Totdisclose. Moreover, any observation of an individual's behavi@r-'tt'~
)Vithother 'persOJ~s, m~ to create an erroneous and distorted in an interview, casual conversation, or, other personal '~llcoul1ter-m:lM'
prials in thIS fashIOn ten ,s 01 ~""h nllhlicitv may foster yield information about him that he wouldpr~fer to c.qnCe.E.l1 and that I¢
may reveal unWittingly. The fact that psycI11;)Jogicaltests have often been.
Il/('xl (If Psychological Testing
lit in discussions of the invasion of privacy probably reflects in a position where he will fail or which he will find uncongenial. The
. misconceptions about tests. If all tests were recognized as results of tests administered in a clinical or counseling situation, of course,
.of behavior samples, with 110 mysterious powers to penetrate should not be made available for instihltional purposes, unless the ex-
havior,popular fears and suspicion would be lessened. aminee gives his consent.
'Id also bc noted that all behavior research, whether employing When tests are given for research purposes, anonymity should be pre-
het-observational procedures, presents the possibility of invasion served as fully as possible and the procedures for ensuring such anonym-
'. Yet,as scientists, psychologists are committed to the goal of ity should be explained in advance to the subjects. Anonymity does not,
g,.knowledge about human behavior. Principle 1a in Ethical however, solve the problem of protecting privacy in all research contexts.
s of Psychologists (Appendix A) clearly spells out the psycholo- Some subjects may resent the disclosure of facts they consider personal,
Viction"that socieh' v.·ill be best served when he investigates even when complete confidentiality of responses is assmed. In most cases,
judgment indicate~ investigation is needed." Several other prin- however, cooperation of subjects may be elicited if they are convinced
the other hand, are concerned with the protection of privacy that the information is needed for the research in question and if they _
'the{velfare of research subjects (see, e.g., 7d, 8a, 16). Conflicts have sufficient confidence in the integrity and competence of the in-
may thus arise, which must be resolved in individual cases. vestigator. All research OIl human behavior, whether or not it utilizes
amplesof such confl.ict resolutions can be found in the previously tests, may present conflicts of values. Freedom of inquiry, which is es-
ical Principles in the Conduct of Research tcit11 Human Par- sential to the progress of science, must be balanced against the protection
s (1973). of the individual. The investigator must be alert to the values involved
problem is obviously not simple; and it has been the subject of and must carefully weigh alternative solutions (see Ethical Principles,
"e delibemtion by psychologists and other professionals. In a re- 1973; Privacy and Be1lGvioral Researc11, 1967; Ruebhausen & Brim, 1966).
titled Privacy and Be7IGvioral Research (1967), prepared for the Whatever the purposes of testin tlle rotection f riva
f Science and Technology, the right to privacy is defined as "the two Key concepts: re evanc consent. The information that
the individual to decide for himself how much he will share with t e m iVl ua is asked to reveal must be relevant to the stated purposes
histhoughts, his feelings, and the facts of his personal life" (p. 2). of the testing. An important implication of this principle is that an prac-
fllrthercharacterized as "a right that is essential to insure dignity ticable effOlts should be made to ascertain the validity of tests for the
reedomof sf>lf.determination"-(p. 2). To safeguard personal pri- particular diagnostic or predictive purpose for which they are used. An
jno universal rules can be formulated; only general guidelines £illl instrument that is demonstrably valid for a given purpose is one that
rovided.In the application of these guidelines to specific cases, th~~~ provides relevant information. It also behooves the examiner to make
substitute for the ethical awareness and professional respons~i{9 sure that test scores are correctly interpreted. An individual is less likely
Ie individual psychologist. Solutions must be worked out in ter~ p£ to feel that his privacy is being ~aded by a test assessing his readiness
:particularcircumstances. - for a particular educational progrlfm than by a test allegedly measuring
:'nerelevant factor is the purpose for which the testing is conducted- his "innate intelligence."
'ther for individual counseling, institutional decisions regarding~~lec- The concept.,£.f informed consellt also requires clarification; and its ap-
and classification, or research. In clinical or counseling sit1,j.tions, the plication in individual cases mav call for the exercise of considerable
_ t is usually willing to reveal himself in order to obtain h~]p with his judgment (Ethical Principles, 1973;,Ruebhausen & Brim, 1966). The ex-
,oblems.The clinician or examiner does not invade privacy'where he is aminee should certainly be infoJ'l!le~.about the purpose of testing, the
eelyadmitted. Even under these conditions, however, the client should kinds of data sought, and the use tha1;:wifi be made of his scores. It is not
tie warned that in the course of the testing or interviewing he may reveal implied, however, tliat he be shown the test items in advance or told
:informationabout himself without realizing that he is so doing; or he how specific responses will be scored. Nor should the test items be shown
Irony disclose feelings of which he himself is unawar to a parent, in the case of a minor. Suc~ infonnation would usually in-
- When tes ng IS con uded for institutional purposes, the lfiaffiinee validate the test. Not only would the giving of this information seriously
Isbouldbe fully informed as to the use that will be made of his test scores. impair the usefuhless of an ability test, boutit would alsotcm.d Jo distort
, It is also desirable, however, to explain to the examinee that correct as- responses on many personality tests. For ~xaQJple, if an indi®~,~l is told
sessment will benefit him, since it is not to his advantage to be placed in advance that a self-report inventory-will be scored v.ith adorpinance
Social and Ethical Implications of Testing 53
of Psychological T('sting tent, the hazards of misunderstanding test scores, and the need of various
. fl d by stereotyped (and often
,p'J)se~are likely to bbeIn thu~n:ait or by a false or distorted persons to know the results.
as'he may have a ou t IS , There has been a growing awareness of the right of the individual
himself to have access to the findings in his test re ort. He should also
. . 'th regard to pa-
ng of children, special qU,es~ons anse "':1 e the Russell lave e opportum to comment on e contents of the report and if
necessary to clarify or correct factual information. Counselors are now
~~:(i~;~~~U~~iS~~:~r:i:;:~;n~;:;d~di::I7:rc tfite COeelle~~i~~: trying more and more to involve the client as an active participant in his
, . ' f P '1 Recor s. 11 re ereo O\\'n assessment. For these purposes, test results should be presented in
,.and DissenunatlOl1 0 tip' . d' "d al consent,
nt, the Gujdelines differentiate b;tween l~t:~~o:al consent, a form that is readily understandable, free from technical jargon or
'hild, his'tiparents, or both, an . ~r~~e resentatives, such labels, and oriented toward the immediate objective of the testing.
'arents: legally elected ~r .appoll1t~. . p the Guidelines Proper safeguards must be observed against misuse and misinterpretation
board. 'While avoiding rigId preSC;lpti~n:h type of instru- of test findings (see Ethical Standards. Principle 14).
, and achie",~mcnt tests as examp es °b em' t, at the -In the case 'of minors, one must also consider the parents' right of
" , I nt should e su Cleo, access to the child's test record. This presents a possible conflict with the
,!lich representation a conse .. . "
", t' cite . 11 I
child's own right to privacy, especially in the case of older children. In a
, e, personality ~ss~ssm~~i~:lilles is the inclusion of sample
searching analysis of the problem, Ruebhausen and Brim (1966, pp. 431-
helpfu eature. o. ~ e~~ tten consent. There is also a selected
.forms,for obtammo ' 1 t of school record keeping, 4,32) wrote: uShould not a child, even before the age of full legal re-
sponsibility, be accorded the dignity of a private personality? Considera-
a
, on the ethical and ~ega alsPdec~
d 'penmenta eSlgns
that protect the indi-
rocedur~~ a~o pe:rucipate and that adequately safeguard his
tions of healthy personal growth, buttressed with reasons of ethics,
t ,to eCme. . f 1 data resent a challenge seem to command that this be done." The previously mentioned Guide-
Hevielding scientifically meanmg u 'tP d the establish- lines (Russell Sage Foundation, 1970, p. 27) recommend that uwhen a
, '. . W'th oper rappor an student reaches the age of eighteen and no longer is attending high
c,hologist's ipgenUlty.. 1 pr h b of refusals to
, 1 t however t e num er school, or is married (whether age eighteen or not)," he should have the
titudes of mutua' respec, . 'bl ' tity The technical dif-
may be reduced to a neghgl e quan ' h'; b avoided. right to deny parental access to his records, However, this recommenda-
,bi;sed sampling and voluntee~ error, may : USe;t; tllat this tion is followed by the caution that school authorities check local state
gg
rom both national and stateWide .SUlvley~'t: ,·, nd in the laws for possible legal difficulties in implementing such a policy.
, h . f g educatlona ,ou comes a
Apart from these- possjble exceptjons, the question is not whether to
be achieved, bot III t':s III rch (Holi:zn~~n, 1971; Womer,
'~itiv~area of pers~~allty ;~:~ath(' number of respondents 'who commUDlcute test results to arents of a minor but how to do so. Parents 1
ere Is-also some eVI ence ,',' "on of privacy or norma y have a legal right to information- a out eir child; and it is
. . t enresents an mvaSI . usually desirable for them to have such information. In some cases, more-
a personahty llwen ory r 1" .'. 'S" nt'ly reduced when
' ff nsive 15 slgm ca
''der some of th e.ltems 0 e " : ex lanation of h.Q.YLitemL over, a child's academic or emotional difficulties may arise in part from
:preceded by a Simple. and ~orthrJ ~:d..(Fink & Butcher, 1972). parent-child relations. Under these conditions, the counselor's contact
ted and I 0\ ores WI I be mterpre_ , .1:'~,' h WIth die parents IS of prime importance, both to fill in background data
,- lid' 't hould be' adde~~~"t sue an
standpoint of test va Ity, 1 Sf'" the personality and to elicit parental coope.ration.
'on did not affect the mean profile 0 scores on - , ' Discussions of ~he ~n6dentiality of test records have usuall~ dealt
with accessibility to a thIrd person, other than the in~hjdilal tese~d (or
parent of a minor) and the examiner (Ethical Stando,r.ds, Principle 6;
Russell Sage Foundation, 1970). The underlying principle is that such
IDENTIALITY records should not be released without the knowl~~~. an..d. conseiitOf •
the individual.' .,
, . . which it is related, the problem of
t,~e~rotectlOn ~f p~lVacYiftf:ceted. The fundamental question is: 'Vhen tests are administered in an institutional context, as in a school
tiahty of test ata ISmu {ts? Several considerations influence the system, court, or employment setting, the indi~dual should be .infonne~
at the time of testing regarding the purpose 6f~!he test, how th~ results
all hav~ access. to t~t resAmu
in particular situations.
~ng them are the security of test con- - _._~--_._ _.-._-- ...•.. -- ..
~,:~'
Social alld Ethical Implications of TCStillg 55
'ntext of psychological Testing the c:lpacity to record faithfully, to maintain permanently, to retrieve promptly,
'd nd their availabilih' to institutional personnel who h~v~ a and to communicate both Widely and instantly.
ISC ,ad f th UncJ:e,r'these conditions, nO further penms~lOn
e nee or em. . hi 'h' t1 institutiOn,
ti results are made avalla e Wit III Ie .- The unprecedented advances in storing, processing, and retrieving data
e d at tlIe me 1 r uested by outsiders,
'nt situation exists when test resu ts are eq t It from made possible by computers can be of inestimable service both in re-
" 'm lover or a college requests tes resu s search and in the more immediate handling of social problems, The po-
"~.R:::~~t~~:s: i~st~nc~s, ind~v~d~l~~e~o;:~:~t~o:;:~~:~e~;dt~: tential dangel"s of invasion of privacy and violation of con~dentiality
equired, T~e same r~qUlre%~,nres:~ch urposes. The previously need to be faced squarely, constructively, and imaginatively, Rather than
and coullsehng contexts, or d' 1971 P 42) contain a sample fearing the centralization and efficiency of complex computer systems, we
uidelines (Russell Sage Folun :tlOn"n de~ri~lg the transmission, of should explore the possibility that these very characteristics may permit
iformfor the use of schoo sys ems I , more effecth'e procedures for protecting the security of individual
ta. , f . d'n institutions. On records.
er pr,oblem pertains to the l'ete~tlO~l? recor s I vcr' valuable, An example of what can be accomplished with adequate facilities is
be
,hand, longi1tudinal rec~r~s a~:o l:~~I~::~~:t~~~ing ani'counseling provided by the Link system de\'eloped by the American Council of
y for researc I purposes u . advanta es resuppose proper Education (Astin & Boruch, 1970), In a longitudinal research program
son. As is so often the cas~, th;se t1 othe; haKd the availability on the effects of different types of college environments, questionnaires
. interpretation •.of test resu ts, n m::uses as inl~rrect inferences were administered annually to several hundred thousand college fresh-
rleords opens t~e way f~~ s~ch for otber than the original men, To permit the collection of follow-up data on the same persons
'solete data atld.~unauthollze 1 acbcessd for example to cite an IQ while preventing the identiflcation of individual responses by anyone at
b anifest v a sur , , ,
.gpurpose, I ~ wou Id em: d b a child in the third grade any future time, a three-file system of computer tapes was devised, The
reading achle\'t>ment sco~e, obtalOe II Ye Too much may have hap- first tape, containing each student's responses marked with an arbitrary
n evaluating him for admISSion to co eg 'k h ·1' and ""'lated identincation number, is readily accessible for research purposes. The
, I" 'ears to ma e suc eaI' ,.,..
d to llim in t Ie mtervemng ) d etained fo'l"many second tape, containing only the students' names and addresses with the
, f I S' '1 Iv when recor s are r same identification numbers, was originally housed in a locked vault and
scores meaning u. Iml ar .' b ed for purposes that the individ-
rs,t11ereis dan!!:er that tbey ma): edu~nd would not have approved. used only to print labels for follow-up mailings. After the preparation of
(or his parents) never suspecte 'd t' d either for le- these tapes, the original questionnaires were destroyed.
, when recor s are re ame
a I1revent suc I1 mIsuses, , f h 'd'" d al or for ac- This two-file system repl'esents the traditional security system. It still
, " I 'the interest 0 t e m 1'111 u did not provide complete protection, since some staff members would
late longltudma use m them should be subject to unusual¥i
table research purPloseCs,a,cdcej:setso (Russell Sal1e foundation, 1970:W' have access to both files. ~'Ioreover, such files a-re subject to judicial and
, t troIs In t Ie w e In I:> d t legislative subpoena. For these reasons, a third me was prepared. Known
ngen can . d 1 . 'fi d into three categories-with regar· 0
t2), sch~ol recol' s. are c aSSli~in factor in this classification is the as the Link file, it contained only the original identification numbers and
'I" retenti~n, ~. major det~~~ilih~ of the data; anot\l.er is rdevance to i
a neW set of random numbers which were substituted for the original
ree of objectIVity and ven a 'J 1 I Id be ..s-e for any type of identification ~umbers in the name and address file. The Link file was
. 1 b' ti f the schoo. t wou ,,"", . dcposited at a computer facilit), in a foreign country, with the agreement
e educationa 0 Jec ves ,0 '1 . l' 't policies regardit.g the destruc-
.stitution to fonnulate SHm ar exp lCl d' that the file would never bC;le)eased to anyone, inclu~jpg the American
. . . 'b'1't f personal recor s. Council on Education. Follow-u.p data t~p!s are sent tq the f{)reign fa-
:t!on, retention, and acceSSI I I Y a 't nd accessibility of test results
'", The' pro bl ems 0f mam, . tenance secun y, a . 'fi d bv the develop- cility, which substitutes one set of code numbers f~the other. With the
.~and uf all other ~ersonal da~:n~avlen b~~; ;:~~e eta the Guidelines .decoding files and the research data files under: the control of different
. inent of computenzed, aata . 5-6) Ruebhausen wrote; organizations, no one can identify {he responses of illdividuals ~ the
(Russell Sage Foundation, 1970, pp, , , data files. Such elaborate precautions roi'the protection of conlidentiality
. d a new dimension into the issues of pnvacy. obviously would not be feasible except in a!aJge-scale computerized data
Modernscience has mtl'Oduce h tr t allies of privacy were the in- bank. The procedure could be simplified sQmewhat if the lin\ing faCility'·
' 1e among t e s ongcs ,
Therewas a t Ime W 1 n . d the healing compaSSion
, h f II'b'n" f hiS memorv an were located in a domestic agency given,:adequate protection against
efficiencyof man, tea 1 1 I • ,0 f' t' 'd' the warmth of human reeol-
, d b th the passmg 0 tme an ' subpoena.
lhatat'compame,. 0 ,'." .., ""II" ,_,,"'.\fnrlrrn sciellcehas !!ivenus
Social and "Etl1icalIIll1"ications of Testing 57
onl~. s~ould the data be interpreted by a properly qualified person, but

faclli~Ies shoul.d also be available for counseling anyone who may become
cmOti01~any dIsturbed by such information. For example, a college stu-
dent mIght become seriously' discouraged when he leams of his poor
i$tshave given much thought to the comm~nication of test performance on a scholastic aptitude test. A gifted schoolchild might de-
"formthat will be meaningful and useful. It IS clear that the
should not be transmitted routinely, but should be accom-
nterpretive explanations by a professionally trained person.
velop habits of laziness and shiftlessness, or he might become uncoop-
erahve a~ld unm.anageable, if he discovers that he is much brighter than
any of Ius asso.clates. A severe personality disorder may be precipitated
I
imicating scores to parents. for example, a recommended when a ~aladlust('d individual is given his score on a personality test.
to arrange a group meeting at which a counselor or school
'\explains the purpose and nature of the tests, the sort of
th'tt"t mav reasonably be drawn from the results, and the
Such de~nmental effects may, of course, occur regardless of the correct-
ness or lllcorrectness of the score itself. Even when a test has been ac-
curately administer:d and scored and properly interpreted, a knowledge
1I
II
of the d~ta. Written'reports about their own children may of such a score WIthout the opportunity to discuss it further ~nay be
ributed to the parents, and arrangements made for personal harmful to the individual.
';vithany parents wishing to discuss the ~epol'ts further .. ~e- Counseling psychologists h~e been especially concerned with the de-
how they afe transmitted, however, an Important condItIon
v~lo ment of effective wavs of transmittin test inform' to-their-_
resu1tsshould be prcsented in terms of descriptive perform-
rather than isolated numerical scores. This is especiall}' tnu::..
nee test· which are more likely to be misinter reted than are
c IC11t5 see, e.g., Goldman, 1971. Ch. 14-16). Although the details of
..-tfu~ pr.ocess ~re be}'o~d the, scope of ?~present discussion, two major
gll1del~nes are of particular mterest. FI~ test-reporting is to be \'iewed~
II
't tes as an mtegral part of the counselin rocess and incor orated into the
,ll1icatingresults to teachers, school administrators, emplo'yers, o a counse or-c lent relationshi . Se d, insofar as ossible, test results
approprig.te persons, similar safeguard~ shoul~ b~ proVided. shou e reported as answers to specific !:lucstions raised bv the CQun-
Is of performance and qualitati\·e descnptot~ns 111 Sllnple terms
~. An Important consideration in counseling relates to the' counselee's
preferred over specific numerical. scores, cxc,:pt when com-
~cceptance o~ the information presented to him. The counseling situation
g with adequately trained professlOnals .. Ev~n well-educated
IS such thaf If the individual rejects any information, for whatever rea-
ye been known to confuse percentiles WIth Q~~centa~e scor~s,
sons, then that information is likely to be totally wasted.
with lQ's, norms with standards, and int~Fts~ ratlOgs With
'ores.But a.more serious misinte )fetation )ertams to the con-
rawn from test SCOl'es,even w en their te.c:nnical meaning is
mderstood. A familiar example is the popuhyassumption that
!cates a fixed characteristic of the individual wl)ich pTede-
T~ SETfINC. ~he decades since 1950 have witnessed an increasing
is lifetime level of intellectual achievemen~. , ,- publIc concern With the rights of minorities,' a concern that is reflected in
litcommunication it is desirable to take .i.W:oaccount the char- .
the enactment of civil rights legislation at both federal and state levels.
of the person who is to receive the i~fomlation. This. applies I
In conn~t~on with mechanisms for improving educational and vocational
o at person's general educatIOn 1~:imowledge about psy-
opportumhes ~f such groups, psychological testing has been a major
nd testing. but also to his anticipated eIllotional response to the
focus of att:nbon. Th~ psychological literature of the 1960s and early
on. In the case of a parent or teacher, for. example, personal
197?s co~tams many dI~cussions of the topic, whose impact ran.ges from
I' involvement with the child may interfere with a calm and
clanflcabo~ ~o obfuscation. Among the more clarifying contributions are
'cceptance of factual information. . . several po.slbon papers by professional associ,tit>ns (see, e.g., American
ut by no means least is the problem of commumcatlOg test re-
Psychological Association, 1969; Cleary, Humphreys, Kendrick, & \Ves-
';e individual himself, whether child or adult. The same gene.ral
.'s against misinterpretation apply here as in ~mmuni~tm~ Ie tlthou~h ~omen repre)'lnt a statistical majorltyjn the nati~~al population.
I
ird party. The person's emotional reaction to the mforrnatlOn lS ga I.y,~c~upalJonallY'in in otlu~r ways.they have s~ed Jllany of the problems
of mmoTlhes. Hence w the term "minority" is use(i "fu tnis section it will be
ly important. of course, when he is learnin? about hfs 0'1\'11 assets understood to includj) men. '
.,... :.. ~.. 'H1".~ ,,,, ;nr1;vir'll1:1l
is !!iven hiS own test results, not
Social and Etllicallmplications of Testing 59
. 'onlcxt of Ps!}clIOlogica1Testing
'5' Deutsch Fishman, Kogan, North, & Whiteman, 1964;
'1Jl~use of t~sts 1972). A brief but cogent paper b~ F~augh
Tl:: iarity with such objects. On the other hand, if the development of arith-
metic ability itself is more strongly fostered in one culture than in an-
other, scores on an arithmetic test should not eliminate or conceal such
Iso helps to cle~r away some preval~nt S~ll;C~So~;:; ~~I~~~ural a difference.
'of the concern centers on the lowenng 0 es sc . d . t r- Another, more subtle way in which specific test content may spuriously
ns that ma)' have affected the devel~p;ne~t lofc~;:~~e:;;ti: eoE affect performance is through the examinee's emotional and attitudinal
otivation, attitudes, and other psyc ~ O~IC: for the problem responses. Stories or pictures portraying typical suburban middle-class
ou members. Some of the propose so u Ions . al family scenes, for example, may alienate a child reared in a low-income
mi~nlrstandings about tIle nature anddfllnfction of ps~chdol'Vll?j~~ls inner-city home. Exclusive representation of the physical features of a
. . I b kgroun s 0 groups or 10 single racial type in test illustrations may have a similar effect on mem-
iflerencesin the expenentia ac hI' 1 test
~itably manifested in test performanlce. Ev:rytPsbychaoVl~o~C~ts in- bers of an ethnic minority. In the same vein, women's organizatiDlls have
. 1 I f as Cll ture alIec s e , objected to the perpetuation of sex stereotypes in test content, as in the
res a beh:wlOl' samp e. nso ar If 1 ut aU cultural
will and should be detected by tests. .we ~ e. 0 as a measure portrayal of male doctors or executives and female nurses or secretaries.
I1tialsHom a test, we may th.ereb Ylower Its ;ah~?t case the test Certain words, too, may have acquired connotations that are offensive to
behavior domain it was deslgnc d to assess. n minority groups. As one test publisher aptly expressed it, "Until fairly
'fail to provide the kind of information needed to correct the very recently, most standardized tests were constructed by white middle-class
'ionsthat impaired performance. . 1 citron the people, who sometimes clumsily violate the feelings of the test-taker
ause the testing of minorities repr~sent\a sP:~~~l ~~; :heoretical without even knDwing it. In a way, one cDuld say that we have been not
.er problem of cross-cultural te.stmg, t e U full) in Cha ter 12. so mueh culture biased as we-have been 'culture blind'" (Fitzgibbon,
naleand testh~g procedures ar: ~1~~~:~e~i:'?7s giv:n in Ch~pter 7, 1972, pp. 2--3).
chnicalanalysIs of the concep 0 h t h ter our interest is The major test publishers now make special efforts to weed out in-
, .h l'd't In t e presen c ap , appropriate test cDntent. Their Dwn test construction staffs have becDme
llnnectlOl1Wit test va I I y. ., ., f inDrity groUp
wily in the basic issues and SOCialImplications 0 m sensitized to pDtentially offensive, culturally restricted, or stereotyped
·ng. material. Members of different ethnic groups participate either as regular
staff members or as consultants. And the reviewing of test content with
reference to possible minority implications is a regular step in the process
TEST.RELATED
, .
FACTORS.
. b
In testing culturally di"h·elt·seffPerst
cultural factors t a a ec
°bno~h ~~d
i:e~~ of test construction. An example Df the application Df these procedures
rtant to differentiate etween . t . t d to the test It is in item construction and revision is provided by the 1970 edition of the
I' d th hDse in uence is res nc e - . Metropolitan Achievement Tests (Fitzgibbon, 1972; HarcDurt Brace Jo-
·terionbe laVlor an ose w d ~ Ex~mples of such
atter, tSst-related actors that. re l\~e va 1 .; ion to erEorm vanovich,1972).
to~sinclude previous experience m ~akmg tests, mo~;t. variable; ,~_
r
~veJlon tests, rappDrt with the exammer, an~ an y 0 tet ~_-i<c;€fit~rion
th fcular test but me 1evan O~ __ . --_.-
fectingperformance on .e pa~ I
. s'deration SpeCial en arts s ou
h ld be m'aae toreduce .'
the opera-
. '.
INTERPRETATION AND USE OF TEST SCORES. By far the most important
coflsiderations in the testing of culturally diverse groups-as in all testing
~d I' when testing persons wltn diSSimilar
lion of these test-related factors - .- 'd adequate test- -,;..,pertain to the interpretation of test scores. The most frequent misgiv-
,.ii:ctilfural
bacKg~.n:dS: A d~b1e proc~~urea:\~u~~~;~d\Y the booklets ings regarding the use Df tests with minority group m~w:bers ste~ from
akingorientation and prehmmary prnc iCe,. 'th parallel form is misinterpretations of scores. If a minority examinee Qn~l:li~sa low score
. d" d' Chapter 2 Retestinl1 WI a _ ~--~ on an aptitude test or a deviant score on a personality):est, it is essential
d tape recor mgs cite III '. Ph h had little or no
IsoreeDmmended with low-seorin examm s w a ave - tQ.investigate why he did so. FDr example, an infel~i'St:ore on an arith-
~prl;;e~st~t~e:~ .~:e~c:; also~n~~e:~:s:e;: :~~~e;e~7c ~:::o::~,

unrelated to cntenon per£orm~n tu' of obl'ects unfamiliar in a particular
~::
metic test could result from low test-taking motivation, poor reading
ability, or inadequate knowledge of arithmetic, among other reasons.
Some thought should also be given to the type of nQCWsto be employed
-ample the use of names or piC res . d h di in evaluating individual scores. Depending on the purpose of the testing,
ex l' T ld obviously represent a test-restncte an cap. the appropriate norms may be general nDrms~.!2gl;oUP.Jlotms based Qn
cultura mlleu wou h' k'
. Ability to carry out. quantitative t m mg oes
d not depend upon fami!- - . .
Social alld Et!lical171lplicatiolls of Testing 61
an IQ would thus serve to perpetuate their handicap. It is largely be-

cause implications of permanent status have become attached tq.Jhe IQ
that in 1964 the use of group intelligenGe-testS-..M:asdiscontinued in the
l\ew York City public schools (H. B. Cilbeli, 1966; Loretan, 1966). That
it proved necessary to discard the tests in order to eliminate the miscon-
ceptions, about the fixity of the IQ is a revealing commentary on the
tenacity of the misconceptions. It should also be noted that the use of
individual intelligence tests like the Stanford-Binet, which are admin-
istered and interpreted by trained examiners and school psychologists,
was not eliminated. It was the mass testing and routine use of IQs by
relatively unsophisticated persons that was considered hazardous.
According to a popular misconception, the IQ is an index of innate
intellectual potential and represents a fixed property of the organism. As
will be seen in Chapter 12, this view is neither theoretically defensible
nor supported by empirical data. \Vhen properly intcrrireted, intelligence
test scores should not foster a l'igid categorizing ~f persons. On the con-
hary, intelligence tests-and any other test-may be regarded as a map
on which the individual's present position can be located. When com-
bined with information about his experiential background, test scores
should facilitate effective planning for the optimal development of the
individual.
OBJECTIVITY OF TESTS. "'hen social stereot:'pes and prejudice may dis-

tort interpersonal evaluations, tests provide a safeguard against fa-
voritism and arbitrary or capricious decisions. Commenting on the use of
tests in schools, Gardner (1961, pp. 4&-49) wrote: "The tests couldn't see
whether the youngster was in rags or in tweeds, and they couldn't hean
the accents of the slum. The tests revealed intellectual gifts at every level
of the population."
In the same vein, the Guidelines for Testitlg Minority Group Children
(Deutsch et at, 1964, p. 139) contain the follOWingobservation:
Many bright, non-conforming pupils, with backgrounds different from those of

their teachers, make favorable showings on achievement tests, in contrast to
their low classroom marks. These are very often chffaren whose cultural handi-
caps are most evident in their overt social and interpersonal behavior. Without
the intervention of standardized tests, many such children would he stigma-
tized by the adverse subjective ratings of teachers who tend to reward can·
formist behavior of middle-class character.
\Vith regard to personnel selection, the contr!!>ution:,of t~sts was aptly

characterized in the following words by John ,¥:,
Macy, Jr.,'Chairman of
the United States Civil Service Commission (7.f~~!f,rg and Public Policy,
1965, p. 883) :""'.
Social and Etlticallmplications of Testing 63
ntcxt of Psychological Testing
" ., f pIc that are related to job per- the federal courts. In states having an approved FEPC, the Commission
sityto measure charactenS!lCS 0 peo h' h' the basis for entrv will defer to the local agency and will give its Bndings and conclusions
, is at the very root of the merit system, ~v~u:s over the veal'S, th~ "substantial weight."
~areerservices of ~hel~ederalt ~o\t'he:::l~pmen't and application of The Office of Federal Contract Compliance (OFCC) has the authority
.. .. h s had a vita mteres m d bt that the widesprea d pu bl'Ie
eTVIcea to monitor the use of tests for employment purposes by government con-
gicaltesting methods. I ha\'~ ~o ou d res has in large part been tractors. Colleges and universities are among the institutions concerned
" in the objectivity of ~ur 111 nn gp;~ce ~: the' practicality, and the with OFCC regulations, because of their many research and training
"by the public's perception 0 f t he alrne., .
-.'ofthe appraisal methods they must submit to. grants from such federal sources as the Department of Health, Educa-
". • 101 ee Selection Procedures, prepared by the tion, and Welfare. Both EEOC and OFCC have drawn up guidelines re-
:GUldeltnes on Emp y. ., (1970) as an aid in the garding employee testing and other selection procedures, which are vir-
". I t 0 portumty CommiSSIOn tulillly identical in substance. A copy of the EEOC Guidelines on Em-
mp oymen P b' 'th the following state-
'entation of the Civil Rights Act, rgm WI ployee Selection Procedures is reproduced in Appendix B, together with
purpose: a 1974 amendment of the OFCC guidelines clarifying acceptable pro-
I h belief that properly validated and cedures for reporting test validity,3
elin,esin this part ar~ based o~ ~e: can significantly contribute to the
fzedemployee selection proce u I I'CI'es as required bv Title Some major provisions in the EEOC Guidelines should be noted, The
' ' . t personne po I ' , ,
Equal Employment Opportunity Act prohibits discrimination by em-
entation of no~ d Iscnmma or; . llv developed tests, when used in
(is also recogmzed that pro esslon~ ;~sessment and complemented by ployers, trade unions, or employment agencies on the basis of race, color,
'ction with other tools of perso~n~fi tl,'d in the development and religion, sex, or national origin, It is recognized that properly conducted
" f ' b d' may sign! can 'Ii al d testing programs not only are acceptable under this Act but can also
programs 0]0 eSlgn, - d . deed aid in the utilization an
tenanceof an efficient work force an , In , contribute to the "implementation of nondiscriminatory personnel poli-
servationof human resources generally, cies." Moreover, the same regulations specified for tests are also applied
, b 'sused in testing culturally disadvantaged to all other formal and informal selection procedures, such as educational
nsummar)' .'. tests can e ,Iml ' When properly used,'h owever, th e)'
. ns-as 111 testmg aD.yon~ ese, ting irrelevant and unfair discrim- or work-history requirements, interviews, and application forms (Sec-
tions 2 and 13),
, e an important fun~tlOn 111 pre~te~ive index of the extent of cultural
. 'ti' The\' also prOVIde a quanti ~ .. \Vhen the use of a test (or other selection procedure) results in a
lOaon, - . d'al programs
nandicapas a necessar~' first step In reme 1< • significantly higher rejection rate for minority candidates than for non-
minority candidates, its utility must be justified by evidence of validity
for the job in question. In defining acceptable procedures for establish-
be of states enacted legislation and estlt •••
AL REGULATIONS. Anum. r ., (FEPC) to implement i,t..:,. ing validity, the Guidelines make explicit reference to the Standards for
dF . E 10 ment Practices CommiSSions. -1'im\'!\. Educational and Psychological Tests (1974) prepared by the American
"e aIr mp y f h legal mechanisms at the federal l~;l~~'
nor to the development 0 suc lIotts have been made to pat- PsycholOgical Association. A major portion of the Guidelines covers mini-
1iI0ngthe states that did so 7t~r, sfme;e\ The most pertinent federal mum requirements for acceptable validation (Sections 5 to 9). The
reader may find it profitable to review these requirements after reading
tern th~ re?ulatio~s after the e u~tE:olo '~ent Opportunity Act (Title
legislatIOnIS provld.ed by the ~q 1964 ~ ?ts subsequent amendments).> the more detailed technical discussion of validity in Chapters 6 and 7 of
'n of the Civil Rl?hts Act o. a~ ;nfottement is vested in the this book. It will be seen that the requirements are generally in line with
, e.sponsibility for Implementation an ., (EEOC) When charges good psychometric practice.
, 0 rtunity C ommlSSlon . In the final section, dealing with affirmative action, the Guidelines
qual Employment ppo. h plal'nt and if it finds the charges
, " h EEOC' shgates t e com , point out that even when selection procedures have been satisfactorily
-arefiled, t e lllve t th 'tuation through conferences and
. '6 d'" first to correc e Sl
to be lush e , u1.es d f '1 EEOC may proceed to
r If these proce ures al, 3 In 1973, in the interest of simplIficationand improved coordination,the prepara-
voluntary com~ lance. d d . t orders and finally bring action in tion of a set of uniform guidelines was undertaken by the Equal Employment Op-
hold hearings, ISsue cease an eSlS , portunity Coordinating Council, consisting of representatives of E ,the U.S,
. al developmentssince midcentury, including
1 Department of Justice, the u.s. Civil Service Commission,the U.S'c,rtlJlent of
'A brief summary of ~he major e~ d rt decisions, can be found in Fincher Labor, and the U.S. Commissionon Civil Rights. No'uniform versioD,o<... et. 1u!s
legislativeactions, executive orders, an cou yet been adopted. " '•.
(1973).
Context of Psychological Testing
'ted, if disproportionate rejection rates result for minorities, steps
e.takento reduce this discrepancy as much as possible. Affirmative
'~impliesthat an organization does more than merely avoiding dis-
PART 2
'. ry practicCli,.Psychologically, affirmative action programs may
ded as eHorts to compensate for the residual effects of past social
~s.Such effects may include deficiencies in aptitudes, job skills,
Primipus of
~,motivation, and other job-related behavior. They may also be
'~iniH~erson'sreluctance to apply for a job not traditionally open
" ndidates, or in his inexperience in job-seeking procedures.
~mative actions in meeting these problems include re-
Psychological listing
media most likely to reach minorities;, explicitly en-
minority candidates to apply and following other recruiting
esignedto counteract past stereotypes; and, when practicable,
special training programs fOI the acquisition of prerequisite
knowledge.
CHAPTER 4
NornlS a'nd the
11lterjJretation of Test Scores
NTHE absence of additional interpretive data, a raw score on any
I psychological test is meaningless. To say that an individual has

correctly solved 15 problems on an arithmetic reasoning test, or
identified 34 words in a vocabulary test, or successfully assembled a
mechanical object in 57 seconds conveys little or no information about
his standing in any of these functions. Nor do the familiar percentage
scores provide a satisfactory solution to the problem of interpreting test
scores. A score of 65 percent correct on one vocabulary test, for' example,
might be equivalent to 30 percent corred on another, and to 80 percent
correct on a third. The difficulty level of the items making up each test
will, of course, determine the meaning of the score. Like aU raw scores,
percentage scores can he interpreted only in terms of a dearly defined
and uniform frame of reference.
Scores on psychological tests are mOst commonly interpreted by ref-
erence to norms which represent the test performance of the stand-
ardization sample. The norms are thus empirically established by de-
termining what a representative group of persons actually do on the test.
Any individual's raw score is then referred to the distribution of scores
obtained by the standardization sample, to discover where he falls in that
distribution. Does his score coincide with the average performance of the
standardization group? Is he slightly below average? Or does he fall near
the upper end of the distribution?
In order to determine more precisely the individual's exact position
with reference to the standardization sample, the raw score is converted
into some relative measure. These derived scores are designed to serve a
dual purpose. First, they indicate the individual's t~lativ.e standing in
the normative sample and thus permit an evaluation of his'performance
in reference to other persons. Second, they provide comparable measures
that permit a direct comparison of the individual's performance on dif-
ferent tests. For example, if an individual has a raw score of 40 on a
vocabulary test and a:raw score of 22 on an arithmetic reasoning test, we
67
il1lcsof Psychological Tcstillg
TABLE 1
'nownothing about his relative performance on the two tests. Frequency Distribution of Scores of 1 000 C II Stud
in vocabulary or in arithmetic, or equally good in both? Since on a Code-Learning Test ' 0 ege ents
'.9ndifferent tests are usually expressed in different units, a (Data from Anastasi, 1934, p. 34)
,a)'isollof such scores is impossible, The difficulty level of the
est would also affect such a comparison between raw scores.
- Class Interval Frequency
,~s,on the other hand, can be expressed in the same units
"to the same or to closely similar normative samples for 52-55 1
. The individual's relath'e performance in many different 48-51 1
,thusbe compared. 44-47 20
ariousways in which raw scores may be converted to fulfill 40-43 73
p.vesstate'd above. Fundamentally, however, derived scores 36-S9 156
)0 one of two major ways: (1) developmental level at- 32-35 328
28-31 244
relative position within a specified group. These types of
24-27 136
~r with some of their common variants, will be considered
20-23 28
::tions of this chapter. But first it ,vill be necessary to ex- 16-19 8
'elementary statistical concepts that underlie the develop- 12-15 3
'zation of norms. The following section is included simply 8-11 2
.meaningof certain common statistical measures. Simplified 1,000
.examples are given onl~; for this purpose and not to pro- •. ~-:-na-= fa
'~ statistical methods. For computational details and spe-
s to be ~llowed in the practical application of tl1ese tech- ~~~ws:e~n~and 11, three b~tween 12 and 15, eight ,between 16 and 19,
er is refeHed to any recent textbook on psychological or
atistics. The information provided b f
presented graphicallv in the f y af r~~ue~lcy. distribution can also be
the data of Table 1 'l'n gra h,orm ° a lstnbubon curve. Figure 1 shows
are the scores grouped int
f
p lC orm.
I'
on the b ase rme, or h'onzontal axis,
frequencies, or number of o. c ass/1~ervals: .~n the vertical axis are the
graph has been plotted I' teases a m gwlthm each class interval. The
ject of statistical method is to organize and summarize n wo ways both fo be' .
)~ in order to facilitate their understanding. A list of 1,000 In the histogram, the hei ht of the :x.l rms 109 m common use.
be an overwhelming sight. In that form, it conveys little- terval corresponds to the g b umn erected over each class in-
step in bringing order into such a chaos of Iaw data is to can think of each individ n~mt erd~f persons scoring in that interval. We
es into a frequency distribution, as illustrated in Table l.
'on is prepared by grouping the scores into convenient
column In the fre
is indi~ated by a q;~i:~YY~~'
ua 1s an mg on another's shoulders to form the
~o th~ number of persons in each interval
across from the appro n~atacef m t e center of the class interval and
d tallying each score in the appropriate interval. When , , ,p erequency The s c' .
.n entered, the tallies are counted to find the frequency, Jomed by straight I' . u ceSSlVe pomts are then
meso '
es, in each class im"erval. The sums of these frequencies Except for minor irregularities th di 'b . .
'e total number of cases in the group, Table 1 shows the resembles the bell-shaped normdl e stn ution por~ayed in Figure 1
,~ollegestudents in a code-learning test in which one set ~erfect normal curve is reproduce;~:~i A mathem.atically dete~jned,
ds, or nonsense syllables, was to be substituted for an- lmportant mathematical TO erti ' , . gu:e 3, This type of curve has
, ~cores, giving number of correct syllables substituted of statistical a~alyses FoP thP es and prOVIdesthe basis for many kinds
. represent purpo h
Inute trial, ranged from 8 to 52. They have been grouped tures will be noted E ti n h se, owever, only a few fea-
. ssen a y t e curve . d' th "
'1s of 4 points, from 52-55 at the top of the distribution number of ca 1 " m lcatesat'J4~ largest
ses custer In the center of the range and thattlie nu;ri15er
Ie frequency column reveals that two persons scored
Norms and tile Interpretation of Test Scores 71
Principles of Pbycl1010gical Testing
~he most ~bvious and faniiliar way of reporting variability is in terms of
ps off gradually in both directions as the extremes are approached. e range etween the highest and lowest score The ran e h .
.curve is bilaterally symmetrical, with a single peak in the center. cxtrem I d d . g, owever IS
. . e y cru c an unstable, for it is determined by onl two scores' A
st distributions of human traj,ts, from height and weight to aptitudes smgle unu~ually high 01' low score would thus markedly Iffect its size' A
personality characteristics, approximate the normal curve. In gen- :ore precIse method of measuring variability is based on the d'ff .
I,the larger the group, the more closely will the distribution resemble
theoretical normal curve.
etwee~ eac~ individual's score and the me;n of the
w~t:~
ou
P01~t it will be helpful to look at the exam~Ie
10 c t ~ va~ous measures under consideration have been computed on
1 erence
Table 2 in r~
str:~~~' alu~ a s~an group was chosen in order to simplify the demon-
340
320
• ,< tough 111 actual practice we would rarely perform these co
300 - Frequency polygon putations on so fe' ' T hI m- J
--- Histogram ard statistical sym~o~~~~t s~o~: ~ervetS adlsfotO introduce certain stand-
280 f
e no e or uture reference Original
260 raw scores are conventionally designated by a capital X d . n .
240 used to refer to deviations of each score from the ' an a sma x IS
~ 220 letter I means "sum of" It 'n b group mean. The Greek
i3 200 . th d f . Wi e seen that the first column in Table 2
'0180 g lves
, th e d" ata or the computation'f 0 mean and median. The mean is
•• 160 40 , erne lan IS 405" fall'mg ml'd way b etween 40 and 41-five cases
i 140
:l 120
100 TABLE 2 ~
80 Illustration of Central Tendency and Variabilit)·
60 •• ""JI fi!.=z:r--
40
Diff. Squared
20 --I
24- 28- 32- 36- 40- 44- 48- 52- (:1:2 )
12- 16- 20- 27 31 35 39 43 47 51 55
15 19 23
(Data from Table 1.)

scores
Flc.1. Distribution Curves: Frequenc\: polygon and Histogram. 50% of {:~

cases ~~
~~1
~! J +20
64
49
9
1
1
A group of scores can also be described in terms of some measure.:of
central tendency. Such a measure provides a single, most typical or repJi~-
sentative score to characterize the performance of the entire grouf:- 'The
Medi,n ~ 40.5 ~~:, ~ {E =H -20
16
o
4
most familiar of these measures is the average, more technically known

as the mean (M). As is well known, this is found by adding all scores ___
~X =
3_2
400 ~ Ixl =
=~J
40 :£x' =
36
64
244
and dividing the sum by the number of cases (N). Another measure of
central tendency is the mode, or most frequent score. In a frequency ~X 400
distribution, the mode is the midpoint of the class ihterval with the
M=N=1O=40
highest frequency. Thus, in Table 1, the mode falls midway between 32 AD = }; ixj _ 40_
and 35, being 33.5. It will be noted that this score corresponds to the N - 10}~ 4
highest point on the distribution curve in Figure 1. A third measure of . ~x' 244·
central tendency is the median, or middlemost score when all scores V anance = 0" = -N = -10 -- 24 .40
have been arranged in order of size. The median is the point that bisects
the distribution, half the cases falling above it and half below. SD or u = ~~2 = v'24.40 = 4.9
Further description of a set of test scores is given by measures of varia-
, "', ..• 1.
r ~"'t "f ;..,rl;"i"'l1~ 1 flifkrences
~ ••• around the central tendency.
.,;Principles of Psychological Test ing
'eIcent) are above the median and five below. There is little point in 99.72'1
a mode in such a small group, since the cases do not show c1ear-
tering on anyone score. Technically, however, 41 would repre-
t
I
95.44'1
68.26'1
1
mode, because t",o persons obtained this score, while all other I
I
I
ccur only once. I
and column sho\\'s how far each score deviates above or below I
of 40. The sum of these deviations will always equal zero, be- I
I
.EOsitive and negative deviations around the m~an nec~ssarily. I
I
=
or cancel each other out ( + 20 - 20 0). If we Ignore slgns, of I
I
e Ci,\1l
average the absolute deviations, thus obtaining a measure I
th'eaverage deviation (AD). The symbol Ix\ ill the AD formula
I
that absolute values were summed, without regard to sign. Al- -30' -leT Mean +leT +20'
f ~mne descriptive value, the AD is not suitable for use in fur- FIC. 3. Percentage Distribution of Cases in a NOlmal Curve.
thema'tical analyses because of the arbitrary discarding of signs.
diffe~ent tests in terms of norms, as will be shown in the section on
stan~ard scores. The interpretation of the SD is especi~lly clear-cut when
apphed to a normal or approximately normal distribution curve. In such
-- Lorge SD a distribution, there is an exact relationship between the SD and the
---Small SD
proportion of cases, as shown in Figure 3. On the baseline of this normal
curvc have been marked distances representing one, two, and three
standard deviations above and below the mean. For instance, in the ex-
ample given in Table 2, the mean would correspond to a score of
40, +1u to 44.9 (40 + 4.9), +20' to 49.8 (40 + 2 X 4,9), and so on. The
percentage of cases that fall between the mean and + lu in a normal
Scores curve is 34.13. Because the curve is symmetrical, 34.13 percent of the
Frequenc\'Distributions ...\'ith the Same Mean b~t Different Variahility. cases are likewise found between the mean and -1u, so that between
+ 1u and - 1(1 on both sides of the mean there are 68.26 percent of the
cases. Nearly all the cases (99.72 percent) fall within ±3u from the
. h more serviceable measure of variability is the standard devw-
mean. These relationships are particularly relevant in the interpreta.tion
:mbolized by either SD or u), in which the negative signs are
of standard scores and percentilcs, to be discussed in later sections.
'ely eliminated by squaring each deviation. This p~ has
owed in the last column of Table 2. The sum of thiS column
:by the number of cases ("iN X2
) is known as the variance, or mean
eviatiol1, andc~ymbo1ized by u'. The variance has proved ~x-
One way in which meaning can be attached to test scores is to indicate
'useful in sorting out the contributions of different factors to m-
how far along the normal developmental path the individual has pro-
differences in test performance. For the present purposes, how-
gressed. T~us a~ 8-year-old who performs as well as the average 10-year-
chief concern is with the SD, which is the square root of the
old on an mtelhgence test may be described, as having a mental age of
as shown in Table 2. This measure is commonly employed in
10; a mentally retarded adult who performs at the saifre level would like-
.'g the variability of different groups. In F.igur.e 2,. for e~a~~le,
wise be assigned ~n MA of 10. ~n a different context. 11 i~.urth-grade child
distributions having the same mean but dlflenng In vanabllity.
may be cba.ractenzed as reacbmg the sixth-grade nonn An a reading test
ribution with wider individual differences yields a larger SD
and the t~l~d-grade n~rm in an. ar~thmetic test. Other d~velopm~tal
"one with narrower individual differences.
systems uti!tze more hIghly quahtative deSCriptions of be.JU~yi9I.in ~
Sf) also provides the basis for expressing an individual's scores on r
Norms and the Interpretation of Test Scores 75
'Prillcil,lesof PSljchological Testing
readily ~isualized if ••w~ think ~~ the in.dividual's height as being ex-
unctionsranging from sensorimotor activities to concept formation. pressed 10 tem1S of heIght age. The dIfference in inches between a
-I'erexp~essed, scores based on developmental norms tend. to be height age of 3 and 4.years would be greater tha~ that betw~en a height
oinetricallvcrude and do not lend themselves well to precise sta- age of 10 and 11. OWll1gto the progressive shrinkage of the MA unit, one
treah~e~t. Nevertheless, they have considerable appeal for de- year of acceleration or retardation at, let us sav, age 5 represents a larger
\ve purposes, especially in the intensive clinical study of individuals deviation from the norm than does one vear 'of acceleration or retarda-
or certain research purposeS. tion at age 10, .
'l'l;TAL ACE. In Chapter 1 it was noted that the tenn "mental ~ge" GRADE EQUIVALENTS. Scores on educational achievement tests are often
s;ddelv popularized through the various translations and adaptatiOns interpreted in terms of grade equivalents. This practice is understandable
the Billet-Simon scales, although Binet himself had employed the becaus.e,the t<:stsare employed within an academic setting. To describe
re nelitral term "mental levcl." In age scales such as the Binet and a pupil s ~chlevement as equivalent to seventh-grade performance in
'revisionsjitemsare grouped into year le,·els. For example, those items spelhng, eIghth-grade in reading, and fifth-grade in arithmetic has the
ssedbv the majority of 7-vear-olds in the standardization sample are same popular appeal as the use of mental age in the traditional intelli-
~jacedi~ the 7-year level, tilose passed by the m~j~rity of 8-year-olds gence tests.
~e assignedto the 8-year level, and so fOlth. A child s score on the test ~rade ~orms are found by computing the mean raw score obtained by
',,~11then correspond to the highest year level that he can succe5sful~y chIldren In each grade. Thus, if the average number of problems solved
'omplete.In actual practice, the indh'idual's performance shows a certal~ c~ITectly on .an arithmetic tes~ by the fourth graders in the standardiza-
'~mountof scatter. In other words, the subject fails some tests below h1s hon sample 1S23, th~n a raw score of 23 corresponds to a grade equiva-
mental age level and passes some above it. For this reason, it is c~stom- lent of 4. IntermedIate grade equivalents, representing fractions of a
ar}'to compute the basal age, i.e., the highest age at and below w~lCh all gr~de, a~e usually found by interpolation, although they can also be ob-
testsare passed. Partial credits, in months, are then ~d?ed to thiS basal tamed directly by testing children at different times within the school
,'agefor all tests passed at hi~e:;p~r ~evels The chIld s mental age o~ year. Because the school year covers ten months, successive months can
the test ISthe sum of the ba~:gp ;lvitbe:dditjonaJ months of credit be expressed as decimals. For example, 4.0 refers to average perfonnance
earned at higher age level§.:. - . ' at the beginning of the fourth grade (September testing), 4.5 refers to
~tal age norms have also been employed wl~h ~ests that are l:ot dl- average performance at the middle of the grade (Febmary testing), and
divedinto year levels. In such a case, the subJect s raw scor~ 1S first so forth.
determined. Such a score may be the total number of correct Items on Despite their popularity, grade norms have several shortcomings. First,
the whole test; or it may be based on time, on number of~p"(lrs, or on the content of instruction varies somewhat from grade to grade. Hence,
somecombination of sU~'h measures. The mean raw scores.t;,t)Q~nined by grade norms are appropriate only for common subjects taught through-
the children in each year group within the standardiza~tQn' sample con- o~t the grade le~els covered by the test. They are not generally ap-
stitute the age norms for such a test. The mean raw seore of the 8-~ea~- ph cable at the hIgh school level, where many subjects may be studied
old children, for example, would represent the 8-year nonn. If an ll1d~-i for only one or two. years. Even Vlith subjects taugkt in each grade,
vidual's raw score is equal to the mean 8-year-old raw SCOre,then hiS however, the emphas1s placed on different subjects may vary from grade
mental age on the test is 8 years. All raw scores on such a test can be ~o grade, and ~rogress may therefore be more rapid in oJ1e subject than
transformed in a similar manner by reference to the age nonns. III ~other dUrIng a particular grade. In other words, grade-units are
It should be noted that the mental age unit does not remain constant obv~ously unequal and these inequalities occur irregqllirly in different
with age, but tends to shrin~ with advancing years. For example, a child subjects. ,; .
who is one year retarded at age 4 will be approximately three. years. re- Grade norms are also subject to misinterpretation uni~s ,the test user
tarded at age·12. One year of mental growth from ages 3.to 4 IS eqUIVa- keeps fi~ly in mind the manner in which they were ·deri~ed. For ex-
lent to three years of growth from ages 9 to 12. Since mtellectual de- am~le, .If a fourth-grade child obtains a grade eq.~ivalent of 6~9in arith-
velopment progresses more rapidly at the earlier ages and gradually metic, I.t does ~ot mean that he has mastered thfi aritn,w.etic processes
decreases as the individual approaches his mature limit, the mental age taught In the SIxth grade. He undoubtedly obtained'hjs sc6r~ largely by
unit shrinks correspondingly with age. This relationship may be more
Norms arid the ITltcrprc:tafioTl of Test Scores 77
.Principles of Psyc11010gicaJ Testing .
Since the 19605, there has been a sharp upsurge of interest in the de-
•
">~t . ce 'I·nfouI,th grade arithmetic. It certam. Iy COU Id not
lOrpenorman - • . d 'h fc velopmental theories of the Swiss child psychologist, Jean Piaget (see
I. med that he has the prerequi~ites for seventh-gra e ant me I ~ Flavell, 1963; Ginsburg & Opper, 1969; Green, Ford, & Flamer, 1971).
adc norms tend to be incorrectly regarded as performan~l Piaget's research has focused on the development of cognitive processes
;df. A sixth-grade teacher, for example: may assume tha.t all h~!:e~ from infancy to the midteens. He is concerned with specific concepts
class should fall a! or close to tl~e sixth-grade ,n?rm In ac rade rather than broad abilities. An example of such a concept, or schema, is
ests This misconception is certamly not surpnsmg when g h object permanence, whereby the child is aware of the identity and con-
iare ~sed Yet individual differences within any onc grade ar~ suc tinuing existence of objects when they are seen from different angles
·.,:therange' of achJevement test scores will inevitably exten over or are out of sight. Another widely studied concept is conservation, or
pal grades, the recognition that an attribute remains constant over changes in per-
ceptual appearance, as when the same quantity of liquid is poured into
differently shaped containers, or when rods of the same length are placed
1 t developmental norms derives in different spatial arrangements.
DINAL SCALES. Another approac 1 0 1 b t' f behavior
, hI' E Ipirica 0 serva Ion 0 Piagetian tasks have been used widely in research by developmental
research in chIld psyc 0 og~, . n . 1 d t the description of be-
psychologists and some have been organized into standardized scales,
'pment in infants and young chlldl;n e. 0 1 omotion sensory
?
typical of successive ages in ,SUC uncti~ns as OCt forma~ion. An to be discussed in Chapters 10 and 14 (Goldschmid & Bentler, 1968b;
Loretan, 1966; Pinard & Laurendeau, 1964; Uzgiris & Hunt, 1975). In ac-
.' inati0t, .lingui~~c dc~~~~~;~~:t:~n~f a~ese~lo:~e£ his associates at cordance with Pia get's approach, these instruments are ordinal scales, in
( mes"
p
eAxame1913s l~:~e~
et ~l. 1940; Gesell & Amatruda, 1947; H~lver-
, h d I h th apprOXImate
which the attainment of one stage is contingent upon completion of the
1933) The Gesell Developmental Sc e U es s 0''0
0
e h ff earlier stages in the development of the concept. The tasks are designed
r
lopm~ntallevel in months that the child has attained in eadc 0 °aul to reveal the dominant aspects of each developmental stage; only later
d ptive lan<1uage an person -
.areas of behavior, namely, motor, a ~ Jh'ld' 'behaVior with
are empitical data gathered regarding the ages at which each stage is
typically reached, In this respect, the procedure differs from that fol-
1 Tliese levels are "found by companng tIe CIS h
• 0 0 0k a ran ing from 4 weeks to 38 mont s. lowed in constructing age scales, in which items are selected in the first
typlCalof eight ey at>es, g . d tl uential patterning of
sell and his co-workers emphaSize Ie .seq. f'f'- place on the basis of their differentiating between successive ages.
't d xtenslVe eVidence 0 um or1111 In summary, ordinal scales are designed to identify the stage reached
, behavior development. Th ey CIe e. . f behavior
of developmental sequences and an orderly pdrogressllolllb~ect piaced by the child in the development of specific behavior functions. Although
Iges.For example, tee h hOld'
I s reac I
fons towar a sma 0 ]
, . visual sc.'Oresmay he reported in terms of approximate age levels, such scores
ont of him exhibit a characteristic chronologIcal sequen:e I~ d in are secondary to a qualitative description of the child's characteristic be-
ion and in hand and finger movements. Use of th~ entire an havior. The ordinality of such scales refers to the uniform progression of
'de attempts at palmar prehension OCC~Il'S ~t a~ ear~er ~g~
i~h;: t~:: development through successive stages. Insofar as these scales typically
provide information about what the child is actually able to do (e.g.,
he thumb in opposition to the palm; thIS t)~e 0 pre en~, t pincer-
owedb use of the thumb and index finger In a more e c~en . climbs stairs without assistance; recognizes identity in quantity of liquid
. Y f the ob'ect Such sequential patterning was hkewlse ob- when poured into differently shaped containers), they share important
cg~~;wOalking,st!ir ~limbing, and ~ost ~f th; s~~~~:~l~:o~':~:~~;k features with the criterion-referenced tests to be discussed in a later
t of the first few years, The scales eve ope ~ 'c6nstant section of this chapter.
. do I' the sense that developmental stages follow In a .
e~:~~~hl~tage presupposing m~stery of prerequisite behaVIOr char-
a~teJ'isticof earlier stages.'
, •• . I I" differs from that in statistics, in which an
'. Thisusageof the term ordma sca ~ k l' f individuals wjthout Nearly all standardized tests now provide some foryn of within~group
" .' I that permlt~ a ran -oruenn~ 0
.al scale IS simp y one . . between them' in the statistical sense; o~1 norms. With such norms, the individual's performa,~~,. is evaluated in
i
o
;t.~·
. dge about amount of dilI~r~nce les Ordinal sillIes of child development
arecontra.stedto equal-umt mterva ~:m~~ scale or simplex, in which success- extension of Guttman's analysis to Include nonlinear hi~archies i,~ilescribc:d by Bart
and Airasian (1974), with special reference to Piagetillrr··~al.~".~ .
an
uallydeSignedon theI ~o~~;so:u:c:ss at lower levels (Guttman, 1944). An
:pprformanceat one 1 eve mlp I
__________________________ •••••••••••• ·1
Norms and tile Interpretation of Test Scores 79

Principles of Psychological Testing
whereas raw score differences near the ends of the distribution are
msof the performance of the most nearly comparable standardization greatly shrunk. This distortion of distances between scores can be seen
up, as when comparing a child's raW score with that of ~hi~dren of in Figure 4. In a normal curve, it will be recalled, cases cluster closely at
e same chronological age or in the same school grade. Wlthm-group the center and s~atter more widely as the extremes are approached. Con-
reshave a uniform and clearl\' defined quantitative meaning and can sequently, any glYen percentage of cases near the center covers a shorter
appropriately employed in m~st types of statistical analysis. distance on the baseline than the same percentage near the ends of the
distribution. In Figure 4, this discrepancy in the gaps between percentile
ranks (PH) can readily be seen if we compare the dj$tance between a
PERCEKnLES. Percentile scores are expressed in terms of the percentage PR of 40 and a PH of 50 with that between a PR oero and a PR of 20.
persons in the standardization sample who fall be~ow a given raw Even more stdking is the discrepancy between these distances and that
reoFor exampk, if 28 percent of the persons obtam fewer than 15 between a PH of 10 and PR of 1. (In a mathematically derived normal
bblemscorrect on an arithmetic reasoning test, then a raw score of curve, zero percentile is not reached until infinity and hence cannot be
<j<\rrespdnds to the 28th percentile (P~~). A percentile indicates ~he shown on the graph. )
.J{iiduafs relative position in the standardization sample. ~ercent~les
.:)\150 be regarded as ranks in a group of 100, except th~t m rankmg
ustomary to start countin<1 at the top, the best person m the group Q1 Mdn Q3
1
'ing a rank of one. 'With ~ercentiles, on the other hand, we begin 20130405 06070180 99
I
ing at the bottom, so that the lower the percentile, the poorer the i J i I I : i I
I 1 I I I I I
~ I
1
'dual's standing. .' : \ : I \ I I I
e 50th percentile (P;;(I) corresponds to the medlan, already dls- I I I I I I I
I I I 1 I
d as a measure of central tendency. Percentiles above 50 represent J I I I 1
e-average performance; those below 50 signify inferior p~rforman:e. I I I
I I I
'.25th and 75th percentile are known as the first and thlrd quartile 1 I I
I
hits (Ql and Q3), because they cut off the lowest and highest quarters \
I
the distribution. Like the median, they provide convenient landmarks I
I
Qrdescribing a distribution of scores and comparing it with other dis- I
I
ributions. . +20- +30-

Percentiles should not be confused with the familiar percehtage scores. -30- -10- M +10-
~m ~ ~ ~ 98 99.9
he latter are raw scores, expressed in terms of the percentage of correct
/items;percentiles are derived scores, expressed in terms of perce~ltage of FIC. 4. Percentile Ranks in a NOlmal Distribution.
}<persons.A raw score lower than any obtained in the stand~rdizahon sam-
.:ple would have a percentile rank of zero (Po); one hl~her than any The same relationship can be seen from the opposite direction if we
.. score in the standardization sample would have a percentile rank of 100, examine the percentile ranks corresponding to equal u-distances from the
.(PH"')' These percentiles, however, do not imply a zero raw score and a mean ~f a. normal curve. These percentile ranks are given under the
perfect raw score. graph m Flgure 4. Thus, the percentile difference i;letween the mean and
Percentile scores have several advantages. They are easy to compute + lIT .is 34 (84 - 50). That between + I.,. and +~is only 14 (98 - 84).
and can be readily understood, even by relatively untrained persons. . It IS apparent that percentiles show each indiyf<Jual's relative position
Moreover, percentiles are universally applicable. They can be used In the normative sample but not the amount of <h~ence between scores.
equally well with adults and children and a~e sUit~ble for any type of If plotted on arithmetic probability paper, however, percentile scores
test, whether it measures aptitude or personahty vanables. . can also provide a correct visual pictUre of th~ differences between
The chief drawback of percentile scores arises from the marked 10- sc..or~s. A~ithmetic probability paper is a cr~ss-se<:;rl?npaper i~ which the
equality of their units, especially. at ~he extremes of the distribut~on. If vertical h.nes. are. spaced .in t?e same W~y asltM'percentile p~~nts in a
the distribution of raw scores approx1mates the normal curve, as lS true normal dlstnbubon (as. m FIgure 4), whereas the horizonta.1i~.nes are
of most test scores, then raw score differences near the median or center uniformly spaced, or vice versa (as in Figure 5). Such normqJpe;centile
of the distrihution are exag~erated in the percentile transformation,
Norms and the Interprdation of Test Scores 81
.";pfillciIJles of Psychological Testing
of differences between standard scores derived by such a linear trans-
formation corresponds exactly to that between the ;aw scores. All-proper-
ties of the original distribution of raw scores are duplicated in the
distribution of these standard scores. For this reason, any computations
that can be carried out with the original raw scores can also be carried
out with linear standard scores, withollt any distortion of results.
Linearly derived standard scores are often desilTnatedsimpl\'
b .
as "stand-
ard scores" or "z scores." To compute a :; score, we find the difference
between the individual's raw score and the mean of the normative group
and then divide this difference by the SD of the normative group.
Table 3 shows the computation of z scores for two individuals, one of
whom falls 1 SD above the group mean, the other .40 SD below the
mean. Any raw score that is exactly equal to the mean is equivalent to a
z smre of zero. It is apparent that such a procedure will yield derived
scores that have a negative sign for all subjects falling below the mean.
.Moreover, because the total range of most groups extends no farther
than about 3 SD's above and below the mean, such standard scores will
have to be reported to at least one decimal place in order to provide
sufficient differentiation among individuals.
John Mary Ellen Edgar Jane Dick Bill Debby

TABLE 3
~h-A Normal"PercentileChart. Percentiles are spaced so as to ~orrespond
Computation of Standard Scores
~~Idistancesin a normal distribution. Compare the sc~re. distance ~e-
" hn and Mary with that between EIIen and Edgar; w!.thm both pal:s,
entile difference is 5 points. Jane and Dick differ by 10 percentile X-M
as do Bill and Debby. SD
JOHN'S SCORE BILL'S SCORE
X\=65 X:=58
"'canbe used to plot the scores of different persons. on the same 65 - 60 58 - 60
r thescoresof the same person on different tests. In elther case, the Zl=
5
illinterscoredifference will be correctly represented~ Many aptitude = +1.00
achievementbatteries now utilize this technique in their score pro-
'whichshow the individual's performance in each test. An example
~eIndividualReport Form of the Differential Aptitude Tests, repro-
Both the abovE'conditions, viz., the occurrence of negative values and
d in Figure 13 (Ch. 5).
of decimals, tend to produce awkward numbers that are confusing and
. "AXDARD SCORES. Current tests are making increasing use of standard. difficult to use for both computational and reporting purposes. For this
. scoreswhichare the most satisfactory type of derived score ftom most reason, some further linear transformation is u~u,:lly applied, simply to
~oints'of view. Standard scores express the individual's distance from put the scores into a more convenient form. ,For. ~x~lnple, the scores on
t ", meanin terms of the standard deviation of the distribution. the Scholastic Aptitude Test (SAT) of the College Entrance Examina-
Standardscores mav be obtained by either linear or nonlinear trans- tion Board are standard scores adjusted to a mean ot;~:, and an SD of
ationsof the origi~al raw scores.Whe~ found by a l.in.eartransforma- 100. Thus a standard score of -Ion this test would b: . ressed as 400
; theyretain the exact numerical r~labons of the ongmal raw scores, (500 - 100 = 4(0). Similarly, a standard score of + l.S ou1ltcorrespond
. usethey are computed by subtracting a constant from each raw score to 650 (500 + 1.5 X 100 = 650). To con"er~ an origi~$ll!tandard score to
thendividing the result by another con~tant The relative magnitude the new scale, it is Simplynecessary to multiply the standard score by the
Principles of P~Y;'IO'ogical Testing
'ed SD (100) and add it to or subtract it from the desired mean ard score is obtained. Normalized standard scores are expressed in the
). Any other convenient values can be arbitrarily chosen for the same form as linearly derived standard scores, viz., with a mean of zero
,mean and SD. Scores 011 the separate subtests of the Wechsler In- and an SD of 1. Thus, a normalized score of zero indicates that the indi-
ence Scales, for instance, are converted to a distribution with a vidual falls at the mean of a normal curve, excelling 50 percent of the
1 of 10 and an SD of 3. All such measures are examples of linearly group. A score of -I means thafhe surpasses approximately 16 percent
sformed standard scores. of the group; and a s(:ore of + I, that he surpasses 84 percent. These per-
'twill be recalled that one of the reasons for transforming raw scores centages correspond to a distance of 1 SD below and 1 SD above the
o any derived scale is to render scores on different tests comparable. mean of a normal curve, respectively, as can be seen by reference to the
e linearlv derived standard scores discussed in the preceding section bottom line of Figure 4.
" be cO~lparable only when found from distributions that have ap- Like linearly derived standard scores, normalized standard scores can
ximately the same form. Under such conditions, a score corresponding be put into any convenient form. If the normalized standard score is
~.I SD above the mean, for example, signines that the individual occu- multiplied by 10 and added to or subtracted from 50, it is converted into
ies the same position in relation to both groups. His score exceeds ap- a T score, a type of score first proposed by McCall (1922). On this scale,
roximately t1J.e.same percentage of persons in both distributions, and a score of 50 corresponds to the mean, a score of 60 to 1 SD above the
is percentage can be determined if the form of the distribution is mean, and so forth. Another well-known transformation is represented
'known.If, howeyer;"one distribution is mal'kedly skewed and the other by the stanine scale, developed by the United States Air Force during
"normal,a z score of +1.00 might exceed only 50 percent of the cases in World War II. This scale provides a single-digit system of scores with a
,negroup but would exceed 84 percent in the other. mean of 5 and an SD of approximately 2.3 The name stanine (a contrac-
In order to achieve comparability of scores from dissimilarly shaped tion of "standard nine") is based on the fact that the scores run from
,distl-ibutions,nonlinear transformations may be employed to fit the scores 1 to 9. The restriction of ~cores to single-digit numbers has certain
to any specified type of distribution curve. The mental age and percentile computational advantages, for each score requires only a Single column
scores described in earlier sections represent nonlinear transformations, on computer punched cards.
but they are subject to other limitations already discussed. Although
under certain circumstances another type of distribution may be more TABLE 4
appropriate, the normal curve is usually employed for this purpose. One Normal Curve Percentages for Use in Stanine Conversion
of the chief reasons for this chotee is that most raw score distributions
approximate the normal CUJ;V-e more closely than they do any other type Percentage
of curve. Moreover, physical me1tsures such as height and weight, which Stanine
use equal-unit scales derived. thl:"'t'fugh physical operations, generaU,y yield
normal ~istributions., Anoth'1f"frnportan: advantage .of the ~or.~al :~rve
is that It has many useful mathematical properties, whlchl""faclhtate
Raw scores can readily be co~verted to stanines by arranging the origi-
further computations. nal scores:in order of size and ~,~fn assigning stanines in accordance with
NQrmalized standard scores are standard scores expressed in terms of a the normal curve percentages"re,produced in Table 4. For example, if
distribution that has been transformed to fit a normal curve. Such scoreS tlJ.e group consists of exactly I()() persons, the 4 lowest-scoring persons re-
can be computed by reference to tables giving the percentage of cases ceive a stanine score of 1, the next 7 a score of 2, the next 12 a score of 3,
falling at different SD distances from the mean of a normal curve. Firsf, and so on. When the group contains more or fewer than l00~cases, the
the percentage of persons in the standardization sample falling at or number corresponding to each deSignated percentage is first computed,
above each raw score is found. This percentage is then located in the and these numbers of cases are then given the appropriate stanines.
normal curve frequency table, and the con-esponding normalized stand- "'c
-" 3 Kaiser (1958) proposed a modification of the staninl!'scale thaq~volves slight

2 Partly for this reason and partly as a result of other theoretical considerations. it
has frequently been argued that, by normaliZingraw scores. an e(lual-unit scale could (;han~es in the percentages and yields an SD of exactly 2, thus being e~Werto handle
be developcd for psycholo~ical measurement similar to the equal-twit sL-dlesof physi- quantitatively. Other variants are the C scale (Guilford & ltruchter, :1,.913" Ch. 19),
cal measurement. This, however, is a debatable point that involves certain question- consisting of 11 units and also yielding an SD of 2, and tl.!~~lO-Uilitstefl scale, with
5 units above and 5 below the mean (Canfield, 1951}.'\: ".
able assumptions. Co
Prillciplcs of Psycl1010gical Testing Norms and the Interpretation of Test Scores 85
us,out of 200 cases, 8 would be assigned a stanine of 1 (4 percent of for comparability of ratio IQ's throughout their age range. Chiefly for
= 8). With 150 cases, 6 would receive a stanine of 1 (4 percent of this reason, the ratio IQ has been largely replaced by the so-called devi-
== 6). For any group containing from 10 to 100 cases, Bartlett and ation IQ, which is actually another variant of the familiar standard score.
,erton (1966) have prepared a table whereby ranks can be directly The deviation IQ is a standard score with a mean of 100 and an SD
rted to stanines. Because of their practical as well as theoretical that approximates the SD of the Stanford-Binet IQ distribution. Al-
rimtages,stanines are being used increasingly, especially with aptitude though the SD of the Stanford-Binet ratio IQ (last used in the 1937
achievement tests. edition) was not exactly constant at all ages, it fluctuated around a
Ithough nOlmalized standard scores are the most satisfactory type of median value slightly greater than 16. Hence, if an SD close to 16 is
.refor the majority of purposes, there are nevertheless certain tech- chosen in reporting standard scores on a newly developed test, the result-
al objections to normalizing all distributions routinely. Such a trans- ing scores can be interpreted in the same way as Stanford-Binet ratio
:)ation should be carried out only when the sample is large and rep- IQ's. Since Stanford-Binet IQ's have been in use for many years, testers
Iltative and when there is reason to believe that the deviation from and clinicians have become accustomed to interpreting and classifying
in~litvresults from defects in the test rather than from characteristics test performance in terms of such IQ levels. They have learned what to
he sample or from other factors affecting the behavior under con- expect from individuals with IQ's of 40, 70, 90, 130, and so forth. There
ration/it should also be noted that whpn-the original distribution of are therefore certain practical advantages in the use of a derived scale
scores approximates normality, the linearly derived standard scores that corresponds to the familiar distribution of Stanford-Binet IQ's.
the normalized standard scores will be very similar. Although the Such a correspondence of score units can be achieved by the selection of
:ods of deriving these two types of scores are quite different, the numerical values for the mean and SD that agree closely with those in
tiltingscores will be nearly identical under such conditions. ObViously, the Stanford-Binet distribution.
.!proeessof normaliZing a distribution that is already virtually normal It should be added that the use of the term "IQ" to designate such
r produce little or no change. Whenever feasible, it is generally more standard scores may seem to be somewhat misleading. Such IQ's are not
'rable to obtain a normal distribution of raw scores by proper adjust- derived by the same methods employed in finding traditional ratio IQ's.
,t of the llifficulty' level of test items rather than ~by subsequently They are not ratios of mental ages and chronological ages. The justifi-
alizing a markedly nonnormal distribution. With an approximately cation lies in the general familiarity of the term "IQ," and in the fact
al distributiou of raw scores, the linearl\' derived standard scores that such scores can be interpreted as IQ's provided that their SD
,servethe same purposes as normalized st;ndard scores. is approximately equal to that of previously known IQ's. Among the first
tests to express scores in terms of deviation IQ's were the \Vechsler In-
telligence Scales. In these tests, the mean is 100 and the SD 15. Deviation
.~ DEVIAT10JlO IQ. In an effort to convert ~1A scores into a ~6rm IQ's are also used in a number of current group tests of intelligence
J of the individual's relative status, the ratio IQ (Intelligence and in the latest revision of the Stanford-Binet itself.
Jient) was introduced in early intelligence tests. Such aIJ.,IQ was \Vith the increasing use of deviation IQ's, it is important to remember
ply the ratio of mental age to chronological age, multiplied by 100 to that deviation IQ's from different tests are comparable only when they
'pate decimals (IQ = 100 X MAjCA). Obviously, if a child's ~IA employ the same or closely similar values for the SD. This value should,
Is his CA, his IQ will be exactly 100. An IQ of 100 thus represents always be reported in the manual and carefully noted by the test user. If
'\i.\ or average performance. IQ's below 100 indicate retardation, a test maker chooses a different value for the SD in making up his devia-
above 100, acceleration. tion IQ scale, the meaning of any given IQ on his test will be quite differ-
" apparent logical simplicity of the traditional ratio IQ, however, ent from its meaning on other tests. These discrepancies are illustrated in
proved deceptive. A major technical difficulty is that, unless the Table 5, which shows the percentage of cases}i1normal distriblltions with
f the IQ distribution remains approximately constant with age, SD's from 12 to 18 who would obtain IQ's at different l~els.These SD
will not be comparable at different age levels. An IQ of 115 at age values have actually been employed in the IQ scales ofp*lJli~hed tests.
r example, may indicate the same degree of superiority as an IQ Table 5 shows, for example, that an IQ of 70 cuts off the lo\v(j:..st3.1 per-
at age 12, since both may fall at a distance of 1 SD from th~ cent when the SD is 16 (as in the Stanford-Binet); but it _",;;y cut off.
. of their respective age distributions. In actual practice, it prm'e,&' as few as 0.7 percent (SD = 12) or as many as 5.1 percen .' = 18) .
. ifficult to constmc:t tests that met the psychometric requiremeritS' An IQ of 70 has been used traditionally as a cutoff point fpl' . ying
5
tage of Cases at Each IQ Interval in Normal Distributions with Mean
and Different Standard Deviations In
esyTest Department, Ha~court Brace Jovanovich, Inc.)

5:
co
v
'0
.8
E
. : 1Q1ilterval z'"
s',b .
SD= 12 SD = 14 SD = 16 SO = 18 0.13%
\Rh
130 above 0.7 1.6 3.1 5.1 -40- -10- Mean +1<1 +2<1
0.13%
120-129 4.3 6.3 7.5 8.5 Test score

+3<1 +4<1
··:110-119 15.2 16.0 15.8 15.4
100-109 29.S} 59.6 26.1}52.2 ;;::}47.2 21.°l z score I ! I I I I I
21.0) 420. -4 -3 I
90- 9~ 29.8 26.1 -2 -I +1 +2 +3 +4
80- 89
70- 79
. Below70
15.2
4.3
0.7
16.0
6.3
1.6
15.8
7.5
3.1
15.4
8.5
5.1
Tscore L
10
I
20 30
I I
40
.. I
50
I !
GO
I
70 80 90
I
Total 100.0 100.0 100.0

I 100,0 CEEB score I
= -,'1II9tA~;r.r ...
).~~"""""~
200 300 - I
400
I
500
I
600 700
I
800
Deviation IQ
mental retardation. The same discrepancies, of course, apply to IQ's of (SD =15) ! I I I
130 and above, which might be used in selecting children for special I I
55 10 85 100 115 130 145
programs for the intellectually gifted. The IQ range between 90 and lIO,
generally described as normal, IJlay include as few as 42 percent or as 4%
Stanine I 7% ,12%,17% ,20%! 11% 112% 17% I 4%
many as 59,6 percent of the popula-tion, depending on the ~est chosen. To
2 3 4 5 6 7 8
be sure, test publishers are making efforts to adopt the umform SD of 16
in new tests and in new editions of earlier tests. There are still enough
variations among cuaently available tests, however, to make the checking Percentile I I I I I I I I I
of the SD imperative. 5 10 20 30 40 50 60 10 80 90 95 !l9

FIC. 6. Relationships among OiHerent Types of Test Scores in a Normal
Distribution.
INTERRELATIONSHIPS OF WITHIN-GROUP SCORES,At this stage in our dis;

falls at a distance of 1 SD above the mean and represents a standard
cussian of derived scores, the reader may have become aware of a
score of + 1.00. Similarly, an IQ of 132 corresponds to a standard score
rapprochement among the various types. of scores. Percentiles ~ave
of +2.00, an IQ of 76 to a standard score of -1.50, and so forth. More-
gradually been taking on at least a graphIC rese~b~a~ce t? norma}ijzed
over, a Stanford-Binet ratio IQ of lI6 corresponds to.~Percertile rank
standard scores. Linear standard scores arc mdlstingmshable from
of approximately 84, because in a normal curve 84 plirc~1it of-the cases
normalized standard scores if the original distribution of raw scores fall helo. +1.00 SD (Figure 4). . ,.
closely approximates the normal curve. Finally, standard s(:ores have. be-
In Figure 6 are summarized the relaHbnships that exist in a normal
come IQ's and vice versa. In connection with the last point, a ree,xamm~-
distribution among the types of scores so far discussed in .this chapter.
tion of the meaning of a ratio IQ on such a test as the Stanford-.Bmet WIll
These include z scores, College Entrance Examination Bqp,rcd (CEEB)
show that these IQ's can themselves be interpreted as standard scores. If
scores, Wechsler deviation IQ's (SD = 15), T SCOres,stanines, and per-
we know that the distribution of Stanford-Binet ratio IQ's had a mean of
11") ronrl ~n qT) of :mnroximatelv 16. we can conclude that an IQ of 1I6 centil~s. Ratio IQ's on any test will coincide with th~g_iven deviation iQ
scale-If they are normally distributed and have an S1). of 15. Any other
Principles of Psychological Testing Norms and the Interpretation of Test Scores 89
ally distributed IQ could be added to the chart, provided we know tests may differ in content despite their similar labels. So-called intelli-
'SD. If the SD is 20, for instance, then an IQ of 120 corresponds to gence tests rrovide many illustrations of this confusion. Although com-
'1 SD, an IQ of 80 to -1 SD, and so on. mon]y descnbed by the same blanket term, one of these tests may include
In conclusion, the exact form in which scores are reported is dictated only v~rba] content, another may tap predominantly spatial aptitudes,
gelyby convenience, familiarity, and ease of developing nonns. Stand- and still another may cover verbal, numerical, and spatia] content in
scores in any form (including the deviation IQ) have generally about equal proportions. Second, the scale units may not be comparable.
placed other types of scores because of c.-ertain advantages they offer As explained earlier in this chapter, if IQ's onone test have an SD of 12
'th regard to test construction and statistical treatment of data .. ~ost and IQ's on another have an SD of 18, then an individual who received
pes of within-group derived scores, however, are fundamentally s1m1lar an IQ of 112 on the first test is most likely to receive an IQ of 118 on the
_. carefully derived and properly interpreted. When certain statistical secon~. !hird, the composition of the s~dardi;;;ation sa'!!Ples used in
conditions are met, each of these scores can be readily translated into establIshmg nonns for different tests may vary. ObViously, the same indi-
...any of the others. ~idu~l will appear to have performed better when compared with an
mfenor group than when compared with a superior group.
Lack of comparability of either test content or scale units can usually
be detected by reference to the test itself or to the test manual. Differ-
ences in the respective normative samples, howeyer, are more likely to
ISTERTEST COMPARISONS, An IQ, or allY other score, should always be be overlooked. Such differences probably account for many otherwise un-
accompanied by the name of the test on which it was obtained. Test explained discrepancies in test results.
~corescannot be properly interpreted in the abstract; they must be re-
e ferred to particular tests. If the school records show that Bill Jones re-
. ceived an IQ of 94 and Tom Brown an IQ of 110, such IQ's cannot be THE NORMATIVE SAMPLE.• Any norm, however expressed, is restricted
accepted at face value without further information. The positions of to the particular normative population from which it was derived, The
these two students might have been reversed by exchanging the par- test user should never lose sight of the way in which norms are estab-
ticular tests that eq,ch was given in his respective school. lished. Psychological test norms are in no sense absolute, univer;!U,or
Similarly, an individual's relative standing in di~erent functions may penn~ne~t. They JIle~ely represent the test performance of the subi.~15
be grossly misrepresented through lack of comparability of test norms. consti~tmg the~i\r..~ardization sample. In choosing such a sample·, af1
Let us s~ppose that a student has been given a verbal comprehension eff?rt IS usual.lr~de t(t'Qbtain a representative cross sectiol\Hlf.the popu-
test and a spatial aptitude test to determine his relative standing in the latIon for which th~.it~st is designed. .
two fields. If the verbal abilitv test was standardized on a random sample In st~tistjca] terminology, a distinction is made between sample and
of high school students, while the spatial tes~ was standardized on a populatIOn. Th: former refers to the group of individuals actually teste (i.
selected group of boys attending elective shop courses, the examiner Th~ latter des1gn~tes the larger, but similarly constituted, group froin
might erroneously conclude that the individual is much more able along which the sample 1Sdrawn. For example, if we wish to establish nonns of
verbal than along spatial lines, when the reverse may actually be the case. test performance for the population of 10-year-old, urban, public schoo]
Still another example involves longitudinal comparisl?,ns of a single boys, ~ve migh~ test a carefully chosen sample of 500 10-year-oJd boys
individual's test performance over time. If a schoolchild's cumulative attendmg PUb~IC schools in several American cities. The sample would
record shows IQ's of 118, 115, and 101 at the fourth, fifth, and sixth be checked w1th reference to geographical distribution, socioeconomic
grades, the first question to ask before interpreting these changes is, level, ethnic (,'omposition, and other relevant characteristics to ensure that
"What tests did he take on these three occasions?" The apparent decline it was truly representative of the defined population.
may reflect no more than the differences among the tests. In that case, In the development and application of test norms, considerable atten-
he would have obtained these scores even if the three tests had been tion should be. given to the standardization sample. It is,,apparent that the
administered within a week of each other. sample on wh1ch the norms are based should be large enough to provide
There are three principal reasons to account for systematic variations stable values., Another, similarly chosen sample of th•.•same population
among the scores obtained by the same individual on different tests. First, should not yIeld nonns that diverge appreciably frorp tfl.ose obtained.
"Prillciplesof Psychological Testing
, with a large sampling error would obviollsly be of little yalue in
~erpretationof test scores. NATION~L ANCHOR NORMS. One solution for the lack of comparability
uallyimportant is the requirement that the sample be representative of n~rms IS to use an anchor test to work out eqUivalency tables for scores
',population under consideration. Subtle selective factors that might ?n dl~erent tests. Such tables are designed to show what score in Test A
. the sample unrepresentative should be carefully investigated. A IS e~Ulvalent to ~ach score in TestB. This can be done by the equiper-
ber of such selective factors are illustrated in institutional samples. cent,ze m.ethod, m which scores are considered equivalent when ther
ausesuch samples are usually large and readily available for testing have equal percentiles in a given group. For example, if the 80th pel:'
oses,they offer an alluring field for the accumulation of normative centile in the same group corresponds to an IQ of lI5 on Test A and to
. The special limitations of these samples, however, should be care- an IQ of 120 on Test B, then Test.A-IQ 115 is considered to be equivalent
yanalyzed. Testing subjects in school, for example, will yield an in- to Test-B-IQ 120. This approach has been followed to a limited extent
'singlysuperior selection of cases in the sllccessive grades, owing to by so~e test publishers, who have prepared equivalency tables for a few
e progressive dropping out of the less able pupils. Nor does such of theIr Own tests (see, e.g., Lennon, 1966a).
iffiinationi?,ffectdifferent subgroups equally. For example, the rate of More ambitious proposals have been made from time to time for cali.
ctiveelimination from school is greater for boys than for girls, and brat~n~ each new test against a single anchor test, which has itself been
/~greater in lower than in higher socioeconomic levels. admllllstered to a highly representative, national normative sample (Len-
S~I~ctivefactors likewise operate in other institutional samples, such ~on, 1966b). No single anchor test, of course. could be used in establish-
.prisoners,patients in mental hospitals, or institutionalized mental re- mg norms for all tests, regardless of content. "'hat is required is a batterY
dates.Because of many special factors that determine institutionaliza- of anchor tests, all administered to the same national sample. Each ne,~'
'n itseH,such groups are not representative of the entire population of ~est could then be checked aKainst the most nearlY similar anchor test
riminals,psychotics, or mental retardates. For example, mental retard- 111 the battery. .
tes with physical handicaps are more likely to be institutionalized than The data gathered in Project TALENT (Flanagan et a!', 1964) so far
re the physically fit. Similarly, the relative proportion of severely re- come closest to providing such an anchor batten' for a high school popu-
arded persons will be much greater in institutiunal samples than in the la~ion. Using a r~ndo~ sample of about 5 per~nt of the high schools in
total population. tIllS country, th~ lllVeStIga.torsadministered a two-day battery of specially
Closely related to the question of representativeness of sample is the cons~ructed aphtude, achIevement, interest, and temperament tests to ap-
needfor defining the specific population to which the norms apply. Obvi- pr~:llnately 400,000 students in grades 9 through 12. Even with the avail-
ous]y,one way of ensuring that a sample is representative is to restrict ~bihty of anchor data such as these, however, it must be recognized tItat
the population to fit the ~ecifications of the available sample. For ex- l~dependen~ly dev.eloped tests ·can ~ever be regarded as completely inter-
. ample, if the population i$ defined to include only 14-year-old school- changeable. At best, the use of natIOnal anchor norms would appreciably
chDdrenrather than all 14-year-old children, then a school sample would reduce the lack of comparability among tests, but it would not elimi.
be representative. Ideally, of course, the desired population should be nate it.
definedin advance in terms of the objectives of the test. Then a suitable Th~ Pro!ec~ TALENT battery has been employed to calibrate several
sample should be assembled. Practical obstacles in obtaining subjects, test battenes III use by the Navy and Air Force (Dailey, Shaycoft, & Orr,
however, may make this goal unattainable. In such a case, it is far better 1962: ~haycoft, Neyman, & Dailey, 1962). The general procedure is to
to redefine the population more narrowly than to report norms on an ideal admllllster both the Project TALENT battery and the tests to be cali-
population which is not adequately represented by the standardization bra~ed to the same sample. Through correlational analysis, a ,composite of
sample. In actual practice, very fe''''' tests are standardized on such broad Project TALENT tests is identified that is most n~ya,dycomparable to
populations as is pORularly assumed. No test provides norms for the each test to be norme?. By means of the equipercentile method, tables
human species! And it is doubtful whether any tests give truly adequate are then prepared g1Vlllg the corresponding scores On the Project
norms for such broadly defined populations as "adult American men," T~LENT composite and on the particular test. For several other bat-
"lO-year-old American children," and the like. Consequently, the samples tenes, data have been gathered to identify the Project TA.Lf:NT com-
obtained by different test constructors often tend to be unrepresentative
of their alleged populations and biased in different ways. Hence, the 4 F~r an excellent analysis of some of the technical difficulties involved in efforts
rr<llJtin~norms are not comparable. to achIeve score comparability with different tests, see Angolf (i~~. 1966, 1971a).
"~-
Norms alld the Intcrpretation of Tcst Scores 93
,~ Principles of Psychological Testing
..positecorresponding to each test in the battery (Cool~y, 1965; Cooley & considered a?ove. Thus, an employer may accumulate norms on appli-
Miller,1965). These batteries include the General AptItude Test Battery cants for a gIVen type of job within his company. A college admissions
'ofthe United States Employment Service, the Differential Aptitude Tests, office may develop norms on its own student population. Or a single
elementa~y school may evaluate the performance of individual pupils in
.andthe Flanagan Aptitude Classification Tesfs .
.Ofparticular interest is The Anchor Test Study conducted by the Edu- terms of Its own sco:e distribution. These local norms are more appropri-
cationalTesting Service under the auspices of the U:S. Office of E~u- ate than broad nahonal norms for many testing purposes, such as the
qation(Jaeger, 197.3). This study represents a systematIc effort to proVIde prediction of subsequent job performance or college achievement, the
comparable and tI'uly representative national norms for the seve~ most comparison of a child's relative achievement in different subjects, or
'dely used reading achievement tests for. elementa~ schoolchIldren. the measurement of an individual's progress o\-er time.
hrough an unusually \vell-controlled ~xpenmental desl.gn, o.ver 300,000
fourth-,fifth-, and sixth-grade schoolchIldren were exammed 111 50 states.
• FIXED REFERENCE GROUP. Although most derived scores are computed
The anchor test consisted of the reading comprehension and vocabulary
btests of the Metropolitan Achievement Test, for which new norms m such a way as to provide an immediate normative interpretation of test
cre established in one phase of the-project. In the equating phase of the perfom~ance, there. ~re some notable exceptions. One type of non-
normative scale utIlIzes a fixed reference group in order to ensure
"d)', each child took the reading comprehension an~ voca?ula~ sub-
ests from two of the seven batteries, each battery bemg paned In turn compar~bility and continuity of scores, without providing normative
with every other battery. Some groups took parallel forms of t~~ t\.•.•o sub- evaluation of performance. \Vith such a scale, normative interpretation
:testsfrom the same battery. In still other groups, all the pamngs were requires reference to independently collected norms from a suitable
'duplicated in reverse sequence, in order to control for order. of ad- population. Local' or other specific norms are often used for this purpose.
ministration. From statistical analyses of all these data, score eqUivalency One of the clearest examples of scaling in terms of a fixed reference
"tablesfor the seven tests were prepared by the equipercentile method. A group is provided by the score scale of the College Board Scholastic
manual for interpreting scores is provided for use by school systems and Aptitude Test (Angoff, 1962, 1971b). Between 1926 (when this test was
. other interested persons (Loret, Seder, Bianchini, & Vale. 1974). first a~ministered) and 1941, SAT scores were expressed on a normative
scale, 111 t~r.ms o~ the mean and SD of the candidates taking the test at
each adm~mstration. As the number and variety of College Board member
SPECIFIC NORMS. Another approach to the nonequivalence of existing colleges l~lcreased and the composition of the candidate population
norms-and probably a more realistic one for most tests-is to standard- changed, It was concluded that scale continuity should be maintained.
ize tests on more narrowly defined populations, so chosen as to suit the Otherwise, an individual's score would depend on the characteristics ot
specificpurposes of each test. In such ca.ses. the limits of the normative the group tes~ed .dUring a particular year. An even more urgent reason
; population should be clearly reported wIth the norms. :hus, the n?rms for scale continu~ty ~temmed from the observation that students taking
" might be said to apply to "employed clerical worke~',s 111 large busll1~sS the. SA~ at certam .hmes of the year performed mOre poorly than those
'. organizations" or to "first-year enginee~ing students. For many test~ng ~akll1g It at other bmes, Qwing to the differential operation of selective
<. purposes. highly specific norms are deSirable. Eve~ w~e~ representatIve
f~ctors. After 1941, therefore, all SAT scores were expressed in terms of
. norms are available for a broadly defined populatIon. It IS often helpful the ~ean and SD of the approximately 11,000 candidates who took the
.tohave separately reported subgroup norms. This is true whenever recog-; test m 1941. These candidates constitute the fixed reference group em-
.•nizable subgroups yield appreciably different scores on a particular ~est. ployed in scaling all subsequent forms of the test. Thus, a score of 500 on
The subgroups may be formed with respect to ag~, grade, type.of curnc~- any form of the SAT corresponds to the mean of the 1941 sample' a score
. lum, sex, geographical region, urban or rural envIronment, soclOeCOnO~T1lc of 600 falls 1 SD above that mean, and so forth. ' ,
'level and manv other factors. The use to be made of the test determmes To permit translation of raw scores on any {prm of the SAT into these
the ~pe of differentiation that is most relevant. as well as whether ~x~d-refere~ce-group scores, a short anc~or test (9r set of common items)
IS lI:c1uded 111 each fonn. Each new form is thereby linked to one or two
general or specific norms are more appropriate.
, Mention should also be made of local norms, often developed by the ~arher forms. which in turn are linked with other forms by"g chin of
test users themselves within a particular setting. The groups employed in Items extend!ng back to the 1941 form. These nonnormative SAT scores
r11'ridnrt s11ehnorms are even more narrow I)· defined than the subgroups can then be mterpreted by comparison with any appropriate distribution
Norms and the Intcrpretat,ion of Test Scores 95
"94 Princil)les of Psychological Testing computer capabilities should serve "to free one's thinking from the con-
of scores; such as that of a particular college, a type of college, a r~gi?n, straints of the past."
etc. These specific norms are. more useful in making colle.ge adml~slon Various testing innovations resulting from computer utilization will be
decisions than would be annu~l norms based on ~he entire. candidate discussed under appropriate topics throughout the book. In the present
o ulation. Any changes in the candidate populatlOn o.ver time, more- connection, we shan examine some applications of computers in the
~v~r,can be detected only with a fixed-score scale. It will be noted that interpretation of test scores. At the simplest level, most current tests, and
the principal difference beh":een the fixed-reference-group scales u~der especially those designed for group administration, are now adapted for
consideration and the previously discussed. scales ~ased on natlOn~1 computer scoring (Baker, 1971). Several test publishers, as well as inde-
anchor norms is that the latter require the chOIce of a. smgle group that IS pendent test-scoring organizations, are equipped to provide such scoring
broadl representative and appropriate for normative purposes. Apart services to test users. Although separate answer sheets are commonly
from the practical difficulties in obtaining such a group and the need to used for this purpose, optical scanning equipment available at some
update the norms, it is likely that for many testing purposes such broad scoring centers permits the reading of responses directly from test book-
norms are not required. . lets. Many innovative possibilities, such as diagnostic scoring and path
Scales built from a fixed reference group are analogous m one respect analysis (recording a student's progress at various stages of learning)
to scales employed in physical measurement. In this connection, Angoff have barely been explored.
(1962}pp. 32--33) writes: At a somewhat more complex level, certain tests now provide facilities
for computer interpretation of test scores. In such cases, the computer
There is hardly a person here who knows the precise origina~ definition of ~he
program associates prepared verbal statements with particular patterns
~:g
I gth of the foot used in the measurement of height or distance, or which
it was whose foot was originally agreed upon as the standard; on t~e
other hand, there is no one here who does not know how to. evalm~te lengt s
of test responses. This approach has been pursued with both personality
and aptitude tests. For example, with the ~1innesota Multiphasic Per-
and distances in terms of this unit. Our ignora~ce of the precise on.gmal me~n- sonality Inventory (MMPI), to be discussed in Chapter 17, test users
. g or derivation of the foot does not lessen Its usefulness to us In a~y "ay. may obtain computer printouts of diagnostic and interpretive stl;\tements
~~susefulness derives from the fact that it remains the same ~ver time and about the subject's personality tendencies and emotional condition,
allows us to familiarize ourselves with it. Needless to say, .preclsely th~ same together with the numerical scores. Similarly, the Differential Aptitude
considerations applv to other units of measurement-the mch, the mile, th: Tests (see Ch. 13) proVide a Career Planning Report, which includes
de ree of Fahrel1h~it, and so on. In the field ofpsych?l.ogical measureme.nt It a profile of scores on the separate subtests as well as an interpretive
. g. 'lar]y reasonable to say that the original defimtlOn of the scale IS or computer printout. The latter contains verbal statements that combine
IS Slml . . h . t ce of a
should be' of no consequence. ~Vhat is of consequence IS t e ~am enan . the test data with information on interests and goals given by the
. t nt scale--which in the case of a multiple-form testmg program, IS
cons a·, d 1 .. f s pIe student on a Career Planning Questionnaire. These statements are
achieved bv rigorous form-to-form equati~g-an . t 1e provlSl~n 0 up.-
,. or'nlative data to aid in interpretation and III the formation of specific typical of what a counselor would say to the student in going over his
men t alY n , . d't' .. nt test results in an individual conference (Super, 1973).
decisions, data which would be revised from time to time as con I lOllSwalla .
.. Individualized interpretation of test scores at a still more complex level
is illustrated by interactive computer systems, in which the individual is
COMPUTER UTILIZATION IN THE INTERPRETATION in direct contact with the computer by means of response stations and
OF TEST SCORES in effect engages in a dialogue with the computer (J. A. Harris, 1973;
Holtzman, 1970; M. R. Katz, 1974; Super, 1970). This technique has been
Computers have already made a Sig~i~cant.impact ,upon eve? phase investigated with regard to educational and vocational planning and de-
of testing, from test construction to admlmstrahon, sconng, reportmg, and cision making. In such a situation, test scores are usually incorporated in
interpretation. The obvious uses of computers-and those develop~d the computer data base, together with other inforn:tation ,,tovided by the
earliest-represent simply an unprecedented increase in the spe~d WIth student or client. Essentially, the computer com~thes all the available
which traditional data analyses and scoring processes can be earned out. information about the individual with storedt-t' ",bout educational
'mportant however are the adoption of new procedures and programs and occupations; and it utilizes all re,lev;tnt' facts and relations
F ar more 1 " .' h' h ld
the exploration of new approaches to psychological testmg w lC wo~ in answering the individual's questions and aiding him in reaching de-,
have been impossible without the fle:dbility, speed, and d~ta-processl~g cisions. Examples of such interactive computer systems, ii!' various stages
('~n:lhiliti('s of computPTS. As Baker (1971, p. 227) SUCCinctlyputs It,
. PrillcijJles of Psychological Testing
Norms and the Interpretation of Test Scores 97 I
:,1\1" '
domain-, and objective-referenced. These terms are sometimes employed
. 1 d IBM's Education and Career Ex-
erationaldevelopment, mc~T;' s S 'stem for Interactive Guidance
!:ionSystem (ECES). a~d fi ld i.
I show good acceptance of
as synonyms for criterion-referenced and sometimes with slightly differ~nt
connotations. "Criterion-referenced," however, seems to have gained II E
ation (SIGI). Prehmmary e na s. nts (Harris 1973). ascendancy, although it is not the most appropriate term.
systemsby high school stud~nts and1 thel roPfart~edata utilized in Typically, criterion-referenced testing uses as its interpretive frame \Ii
t an mtegra par t of reference a specified content domain rather than a specified population
t results aIso repres~n I) I der to present instructional
r
titer-assisted instructwn (CAd .~ n or t le\'el of attainment, the of persons. In this respect, it has been contrasted with the usual norm- lill~:,I
, . t ch stu ent s curren referenced testing, in which an individual's score is interpreted by com-
1 appropnate 0 ea d I ate the student's responses to
'ermust r.epeated~' s~or.ea~ hi~::~onse history, the student may paring it with the scores obtained by others on the same test. In criterion-
'Pg matenal. On t e aSlSo. I . to further practice at the present referenced testing, for example, an examinee's test performance may be
reported in terms of the specific kinds of arithmetic operations he has \:11'
edto more ad.vanced m:te~:r~~~ he receives instruction in more
,I [
,r to a reme~l~l branc . w . nostic anal sis of errors may lead mastered, the estimated size of his vocabulary, the difficulty level of read-
taryprereqUIsItematenal. .Dlag correcr the specific learning ing matter he can comprehend (from comic books to literary classics),
,instructionalprogram desIgned to or the chances of his achieving a designated performance level on an ,1111: :
ltiesidentified in individual cases. f 'ble variant of computer external criterion (educational or vocational).
' t' ally more eaSl
ss costly an d opera Ion d ';nstruction (CMI-see Thus far, criterion-referenced testing has found its major applications
. . computer-manage ,
ion in 1earmng IS , 1 I mer does not interact directly in several recent innovations in education. Prominent among these are
leton,1974). In suc~ syst~~~~t::me:ter is to assist the teacher in computer-assisted, computer-managed, and other individualized, self-
paced instructional systems. In all ,these systems, testing is closely inte-
,~~~~u~~:'nT~e i~~~vi~ualize~ il~struct~~n~f~:~~;U~~~'~eu::;~~ grated with instruction, being introduced before, during, and after
'tionpackages or more ~onventlOn:l t~: rather formidable mass of completion of each instructional unit to check on prerequisite skills,
'utionof the computer IS to proces f f each student in a diagnose possible leaming difficulties, and prescribe subsequent instruc-
'1 d' g the per ormance 0 ,
tional procedures. The previously cited Project PLAN and IPI are
ceumulateddal y regar m. 1 d' dl'fferent activity and to
'I I Y be InvOve In a ' examples of such programs.
;,omW lere eac I, ma ..' xt instructional step for each
these data in prescnbmg the ne -, 'ded by the From another angle, criterion-referenced tests are useful in broad sur-
l' t' of computers are PIOVI
,J,. Examplesof thi,Sapp lCan~~iduallY Prescribed Instruction-see veys of educational accomplishment, such as the National Assessment of
Jsityof Pittsburgh s IPI (1 ) d' Pro)'ect PLAN (Planning for Educational Progress (\Vomer, 1970), and in meeting demands for edu-
.. & GIaser, 1969', Glaser 1968 an . I n- cational accountability (Gronlund, 1974). From still another angle,
" 1 d b the Amencan
i~gin Accordance with Needs) deve ope SYh Brudner & testing for the attainment of minimum requirements, as in qualifying for
I 1971' Flanagan anner, ,
s for Researc~e~t ;~~~ninclud~s a progr~m of self-knowled?e, a driver's license or a pilof s license, illustrates criterion-referenced
!lr,1975). Pro) d t' al planning 'as well as instruction testing. Finally, familiarity with the concepts of criterion-referenced
aualdevelopment, an occupa Ion , testing can contribute to the improvement of the traditional, informal
"entaryand high school subjects. tests prepared by teachers for classroom use. Gronlund (1973) provides
a helpful guide for this purpose, as well as ~ simple and well-balanced
introduction to criterion-referenced testing. A brief but excellent 'discus-
'<TERION-REFERENCED TESTING sion of the chief limitations of criterion-referenced tests is given by
, , h testing that has aroused a surge of Ebel (1972b).
',URE AN~USES.~n appro~c t~ enerally desi<Ynatedas "criterion-
J,particularly 1~ education'd1sbygGlaser (1963) this term is still
d . " Fnst propose '. CONTENTMEANING. The major distinguishing feature of criterion-
)lee testmg. I 'and its definition varies among different wnters.
; mewhat1010asl~~native terms are in common use, such as content-, referenced testing (however defined and whether designated by this
ver,severa ' term or by one of its synonyms) is its interpretation of test performance
in terms of content meaning. The focus is clearly on u;hat the person can
.~ "f ,del)' used CAI system for tE':lching reading to first-,
r a descnptlOn 0 a \\ 1
" 1 ' \ ch'Ll---,.
•
'C'('
'( n-1 \
F, C, :\t1:,,~,n!1 1,,;,
0'
do and what he kno'.\'s, not on how he compares with others. A funda-
~l;:.n( 1 t.H!·(_~T!~({' :: .'~.'-
!
1 " Norms and tIle Interpretation of Test Scores 99

rinciplrs of Psychological T ('sting
indiVidual. has ~r has not attained the preestablished level of mastery .
equirement in constructing this type of test is a. clearly defined When basic skIlls are tested, nearly complete mastery is generally ex-
. f knowledge or skills to be assesscd by the test. If scores. on such pected (e.g., 80--85% correct items). A three-way distinction may also
e to have communicable meaning, the content domam to be be employed, including mastery, nonmastery, and an intermediate doubt-
~lust be widely recognized as important. The selected domain ful, or "review" interval. '
subdivided into small units defined in performance terms. In connection with individualized instru('tion, some educators have
llciHQIlUI context these units correspond to behaviorally defined argued that, given enough time and suitable instructional methods nearly
6nal~.bjectives, 'such as "multiplies three-digit by two-digit ~veryone can achieve complete mastery of the chosen instructio~al ob-
•.or "identifies the misspelled word in which the final e is re- J:etives. Individ~al differences would thus be manifested in learning
,hen addl~g -ing." In the programs prepared for in?ividualized hme rather than In final achievement as in traditional educational testing
ion; these objectives run to several hundred for a smgle school (Bloom, 1968; J. B. C~rroll, 1963, 1970; Cooley & Glaser, 1969; Gagne,
.~Afterthe instructional objectives have been fonnulated, items are 1965). It follows t.hat In mastery testing, individual differences in per-
d to sample each objective. This procedure is admittedly difficult fo~m~nce are of httle or no interest. Hence as generally constructed
, e -consuming. \Vithout such careful specification and control of cnter~on-refer~nced tests minimize indh'idual differences. For example:
..t, however, the results of criterion-referenced testing could de- they lnclude items passed or failed by all or nearly all examinees al-
rite into an idiosyncratic and uninterpretable jumble. though such. ite~ns are usually excluded from no~n-referenced t~sts.'
,en strictly applied, criterion-referenced testing is best adapted for Mas:er~ t.estin? IS r~gularly. employed in the previously cited programs
ng basic skills (as in reading and arithmetic) at elem~ntary le~e1s. fo~ l~dlvlduahzed mstructIon. It is also characteristic of published
heseareas, insh'uctional objectives can also be arranged m an ordmal cr~tenon-referenced tes~ for basic skills, suitable for elementary school.
archy, the acquisition of more elementary skills being prerequisite Exam~le~ of such tes~ mclude the Prescriptive Reading Inventory and
:the acquisition of higher-level skills.6 It is impr~eticab~e a?d probably Pres~np~lve Mathem~tlCsJnventory (California Test Bureau), The Skills
ndesirable, however, to formulate highly speCIfic obJectIves for ad- M:omtor~ng System in Reading and in Study Skills (Harcourt Brace
vancedlevels of howl edge in less highly structured subjects. At these ! o\'anovlch) '. and ~iagnosis: An Instructi onal Aid Series in Reading and
',ievels,both thc content and sequence of learning are likely to be much In Mathematics (ScLCnceResearch Associates).
'moreflexible. Beyond basic skills, mastery testing is inapplicable or insufficient. In
On the other hand, in its emphasis on content meaning in the interpre- more. ad~'~nced and less structured subjects, achievement is open-ended.
tation of test scores, criterion-referenced testing may exert a salutary The ll1dlvJ~ual m~~ progress almost without limit in such functions as
effecton testing in general. The interpretation of intelligence test scores, understandmg, cnbcal thinking, appreciation, and originality. Moreover,
_,for example, would benefit from this approach. To describe a child's content ~vel:a~e m~y p~oc~ed in many different directions, depending
" intelligence test performance in terms of the specific intelJech~al skills upon .the mdl~I~~al s abllibes, interests, and goals, as well as local in-
and knowledge it represents might help to counteract the confuSIOns a~d r . factllties. Under these conditions ' complete ma St ery IS
structional . un-
misconceptions that have become attached .to the IQ. VVhen stated I~ rea lStiCan.d unnecessary. Hence norm-referenced evaluation is generally
these general terms, however, the critenon-referenced approa~h IS enlployed In such cases to assess degree of attainment. Some published
equivalent to interpreting test sCOTesin t~e light of the demonstra~ed tcsts are so constructed as to permit both norm-referenced and criterion-
validity of the particular test, rather than m terms of vague underlymg refe~enced applications. An example is the 1973 Edition of the Stanford
entities. Such an interpretation can certainly be combined with n?rm- AchIevement Test. While providing appropriate norms at each level this
referenced scores. batt~ry ~eets three important requirements of criterion-referenced ;ests:
speclflc~tlO~ of ~etailed instructional objectives, adequate coverage of
each obJective WIth appropriate items, and wide range of item difficulty,
MASTERY TESTING. A second major feature almost always found in It should be noted that criterion-referenced testing is neither as ne~'
criterion-referenced testing is the procedure of testing for mastery. Es-
sentiany, this procedure yields an all-or-none score, indicating that the . : As a resl~lt.of this reduction in variability, the usual methods for findin tdtlJio
~,hty and \'al,d'.ty are,inapplkahle to most criterion-referenced tests. FurtherirSCIlIl.
6ldeaUy, such tests follow the simplex model of a Guttman scale (see Popham & sum of these pomts Willbe found in Chapters 5, S, and 8.
1T1Isck,] 9(9), as do the PiaF:etian ordinal scales discussed earlier in this chapter.
Norms and the Interpretation at Test Scores
'II 101
rinciples of Psychological Testing one I ustrated in Table 6 Tl d .
-/ 171 high school boys en 'II dl~ ata for thIs table were obtained from
clearly divorced from norm-referenced testing as some of its ' ro e m courses in Am' h'
Ictor was the Verbal R' encan Istor)', The pre-
ts imply. Evaluating an individual's test performance in absolute d easomng test of the D'ff t' I .
ch as by letter grades or percentage of correct items, is certainly administered earl . th I eren la Aphtude Tests
y m e course. The crite . 'd I
, er than normative interpretations, More precise attempts to The correlation between test d ~lOn."as en -of-course grades.
scores an crltenon was ,66.
test performance in terms of content meaning also antedate the
lion of the term :'criterion-referenced testing" (Ebel, 1962; TABLE 6
il,l962-see also Anastasi, 1968, pp. 69-70), Other examples may Expectancy Table Showing Relation betwe .
and Course Grades in America H' t f en DAT lerbal Reasoning Test
_ in early product scales for assessing the quality of handwriting, n IS ory or 171 Boys in Crade 11
_tions, or drawings by matching the individual's work sample (Adapted from Fifth Edition Manual for . .
T, p. ll~. Reproduced by permission th~. DIfferential Aptitude Tests, Forms Sand
f a set of standard specimens. Ebel (1972b) observes, further- Corporation, New York, N.Y. All right~~~;:~~~~,~ 1973, 1974 by The Psychological
that the concept of mastery in education-in the sense of all-or-
earning of specific units-achie\"ed considerable popularity in the ~'-=-== -r----=--r:--.:.:----
Percentage Receiving Each Criterion Crade
and 19305and was later abandoned. Test ~umber
om1ativeframework is implicit in all testing, regardless of how Score of Cases Below 70 70-79 80-89 90 & above
, are expressed, (Angoff, 1974). The very choice of content or
to be measured is influenced by the examiner's knowledge of what 40 & above 46 15 22 63
e expected from human organisms at a particular developmental or 30-39 36 6 39 39 17
ctional stage. Such a choice presupposes information about what 20-29 43 12 63 21 5
persons have done in similar situations, Moreover, by imposing Below 20 46 30 52 17
rm cutoff scores on an ability continuum, mastery testing does not
'by eliminate individual differences, To describe an individual's level
--=
ding comprehension as "the ability to understand the content of The first column of Tahle 6 shows h .' .
class intervals' the numb f t d t e test SCOles, dlVlded into four
• ~ett;York Times" still'leaves room for a wide range of indi\'idual " ' er 0 s u ents whose f 11' .
IS gIven in the second column The r " scores. a. mto each mterval
erencesin degree of understanding. table indicate the pe t' f emall1l1lg entnes m each row of the
rcen age 0 cases 'th' h
0:
f
who received each grade at th d f h WI III eac . test-score interval
PECTANCYTABLES.Test scores may also be interpreted in terms of wi~h scores of 40 or above ~e ;e:b e course. ~hus, of the 46 students
celved grades of 70-79 22 al Reasomng test, 15 percent re-
eeted criterion performance, as in a training program or on a job, ' percent grades of 80-89 d 63
s usage of the term "criterion" follows standard psychometric prac- gra d es of 90 or above At th th ' an percent
below 20 on the test '30 e 0 er e~treme, of the 46 students scoring
, as when a test is said to be validated against a particular criterion ' percent receIved gr d b I 7
Ch, 2), Strictly speaking, the term "criterion-referenced testing" etween 70 and 79 a d 17 - a es e ow 0, 52 percent
b
limitations of the a~ai~ble dPtercent between 80 and 89. Within the
uld refer to this type of performance interpretation, while the other
estimates of the probabilit ~ha~tthese. p~rcentages. represent the best
proaches discussed in this section can be more precisely described as
tent,referenced. This terminology, in fact, is used in the APA test
criterion grade. For exam ? 'f an mdlVldual WIll receive a given
34 (i.e" in the 30--39 inte~,:i/ ':e n~w t~udent receives a test score of
ndards (1974). of his obtaining a grade of 90 ~ _" ou . conclude that the probability
n expectancy table gives the probability of different criterion out-
roesfor persons who obtain each test score. For example, if a student of his obtaining a grade betwee ~~ove
r
lS8 17. out of 100; the probability
tains a score of 530 on the CEEB Scholastic Aptitude Test, what are In many practical situation n. ~n 9 IS S9'~ of 100, and so on.
cess" and "failure" in a 'ob ' s, cntena can be dicliotomized. into "suc-
e chances that hislreshman grade-point average in a specific college
these conditions, an e~ e~;::;,se cof study, or othe.r undertak~ng. Under
ill fall in the A, B, C, D, or F category? This type of information can
probability of success oP fa"I y hart can be prepared, showing the
e obtained by examining tbe bivariate distribution of predictor scores . r I ure corresponding t 'h '
SAT) plotted against criterion status (freshman grade-point average), F Igure 7 is an -example f h 0 eac . score mterval.
'f the number of cases in each cell of sueh a bivariate distribution is selection battery developeod ~\h a~.ex:ectanc~ chart. Based on a pilot
y e Ir orcc, thIS expectancy en,lirt shows.
Changedto a percentage, the result is an expectancy table, such as the
CHAPTER 5
No. of
9
Men
21,474 Reliability
8 19,444
7 32,129
refers to the consistency of scores obtained by the
R
LIABILlTY
6 39,398 same persons when reexamined with the same test on different
occasions, or with different sets of equivalent items, or under
5 34,975 othel: variable examining conditions. This concept of reliability underlies
the computation of the error of measurement of a single score, whereby
'23,699
•• we can predict the range of fluctuation likely to occur in a single indi-
3 11,209 vidual's score as a result of irrelevant, chance factors.
The concept of test reliability has been used to cover several aspects of
2 2,139 score consistency. In its broadest sense, test reliability indicates the extent
to which individual differences in test scores are attributable to "true"
904
differences in the characteristics under consideration and the extent to
which they are attributable to chance errors. To put it in more technical
terms, measures of test reliability make it possible to estimate what pro-
, R l' bet "een Performance
7 Expectancv Chart ShowlI1g e atlon \, . . portion of the total variance of test scores is error variance. The crux of
G. • , d E1' ' fan from Primary Flight Trall1JUg.
IectionBattery an IIDlllaI the matter, however, lies in the definition of error variance. Factors that
',(From Flanagan, 1947, p, 58.) might be considered error variance for one purpose would be classified
under true variance for another. For example, if we are interested in
~ . ,'thin each stanine on the battery who measuring fluctuations of mood, then the day-by. day changes in scores
the pertentage of men scormg
. l'
"I ..
W ht trammg
'It
can e
b seen that 77 percent on a test of cheerfulness-depression would be relevant to the purpose of
,failedto camp :t: pnmary. 19 f 1were eiiminated in the course of train- the test and would hence be part of the true variance of the scores. If, on
.ofthe men recelVlDg a stamne 0 . 9 f. 'led to complete the the other hand, the test is designed to measure more permanent person-
" 1I 1 4 t of those at stamne aJ,
lng. W Ii c on y percen es the ercentage of failures ality characteristics, the same daily fluctuations would fall under the
training satisfactorily. Between these ex.trcm ., . Po the basis of this
. 1 the succeSSl'\'e stamnes. n ' heading of error variance.
, decreases consIstent y over ". d f
r
Ie that approximately Essentially, any condition that is irrelevant to the purpose of the test
, expectancy chart, it ~uld be predlcte , °t e:amPco~e of 4 win fail and
'I t d t who obtain a s an me s represents error variance. Thus, when the examiner tries to maintain
40 percent 0f pI 0 ea e s '1 1t 'marv flight train- uniform testing conditions by controlling the testing environment, in-
;tpproximately 60 percent wil1:.atisf~ctor:':b~~~i~ye o~~~cces~ and failure structions, time limits, rapport, and other similar factors, he is reducing
ing. Similar statements .reia: d
m
1 t eh~ receive each stanine. Thus, an error variance and making the test scores more reliable. Despi~e optimum
:v
could be ma.de about. m lVI ua s 60'40 or 3:2 chance of completing testing conditions, however, no test is a perfectly reliablei~strument.
individual wIth a stamne o.f 4 has . . . a criterion-referenced interpre- Hence, every test should be accompanied by a statemellt of its reliability.
primary flight training. Besldebs provldmthg t both expectancy tables and Such a measure of reliability characterizes the test when administered
. f t t es it can e seen a d'
tatlol1 0 es scor., 1 'd f the validitv of a test in pre Ict- under standard conditions and given to subjects simil!lT to those con-
expectancy charts give a genera 1 ea 0 ,
stituting the normative sample. The characteristicsof thiss~mple should
ing a given criterion. therefore be specified, together with the type of reliability that was meas-
ured.
iud/Iles of Psychological Testing
could,of course, be as many varieties of test reliability as there
'lions affecting test scores, since any such conditions might be 9
_for a certain purpose and would thus be classified as error vari- - I i
:
~ m
e types of reliability computed in actual practice, however, are 9
few. In this chapter, the principal techniques for measuring the i i .Jifflll
,,
. of test scores will be examined, together with the sources of I ~.j/ff
N
I , II
iance identified by each, Since all types of reliability are con- OJ
-with the degree of consistency or agreement between two in de-

:g 60-69 ~4H1Hff
"5 iiNt I
By derived sets of scores, they can all be expressed in terms of a ~ 50-59 .Jiff.j/ff'
lion coefficient, Accordingly, the next section will consider some

;basic characteristics of conelation caefficients, in order to clarify
o
~ 40-49 JItt.j/ff
4/It.j/ff1
.j/ff!
I
! ---
use and interpretation, More technical discussion of correlation, as oX .Jifflll

I i
!
:
as more detailed specifications of computing procedures, can be :.j/ff JHt !
:
d:in any elementary textbook of educational or psychological statis- I
! ;
such as Guilford and Fruchter (1973). mr I !, !

i
II I , ,
I
0. 0. 0.
T N MO.O. ()o..
2 I,
o 0
i"'P'? 0
0 0
N M """ -0
Essentiallv, a correlation coefficient (T) ex-
"0
fEAl\,~G OF CORRELATION. Score on Variable J

~ssesthe d'egree of correspondence, '01' relationship, between two sets Bivariate D' t'b' f
;scores,Thus, if the top-scoring individual in variable 1also obtains the ISn utlOn or a Hypothetical Correlation of +1.00.
score in variable 2, the second-best individual in variable 1 is second
might occur by chance, If each individ l'
..~stin variable 2, and so on down to the poorest individual in the group, out of a hat to determine his 'f' tl.a s n~me were pulled at random
ncn there would be a perfect correlation between variables 1 and 2. were repeated for variabl~ C) pOSI IOn m vanable 1, and if the process
uch a correlation would ha\'e a value of + 1.00, Under these conditions l't -, alzderbo~r near~zero correlation would result.
A hypothetical illustration of a perfect positive correlation is shown in , WOu e ImpOSSIblet d' ,
relative standing in variable 2 from k 0 pre, Ict an 1l1dividual's
igure 8. This figure presents a scatter diag~\lm, or hivariate distributiOflt, 1. The top-~oring subJ'ect I'n "bl a1 n~whledge of IllS score in variablE!
ch tally mark in this diagram indicate~~~e score of one individual in , valla I" mlg t scar I' I I
In variable 2. Some individual 'h b h e ug I, ow, or average
th vllriable 1 (horizontal axis) and vain.\:B1e: 2 (vertical axis). It will be ~oth variables, or below ave;a~e~l~gb~th~ ~hance ~core above average in
noted that all of the 100 cases in thee grolJ.l) are distributed along ~~ In one variable and below in the oth .' 'Uers might ~all above average
diagonal running from the lower left- to,'theupper right-hand corner of average in one and at th ' .er, sh others 11lIght be above the
.,'the diagram. Such a distribution indicates a perfect positive correlation , e avel acre 111 the second d f
would be no regularit}, in the relate: h' f ' an so orth. There
(+ 1.00), since it shows that each individual occupies the same relative TI Ie coefficients found in a t I' lOns Ip rom one i d' "d I
n 1\I ua to another.
, ,position in both variables. The closer the bivariate distribution of scares extremes, having some value 'h~ ~1 p~actIce generally fall between these
approaches this diagonal, the higher will be the positive correlation.
Figure 9 illustrates a perfect negative correlation ( -1.00 ). In this case,
lations between measures of
frequentlY low When a
~1,t'
t an zero but lower than 1.00. Corre-
a I,lIes are nearly ;rlways positive, aIthoug'h
there is a complete reversal of scores from one variable to the other. The " negative conel t' . b'
such variables, it usually results f th a IOn IS a tamed between two
best individual in variable 1is the poorest in variable 2 and vice versa, pressed, For example 1'£ t' rom e way in which the scores are ex-
this reversal being consistently maintained throughout the distribution. It , Ime scores are correla't d 'th
negative correlation wl'11 prob bl I Th ,e;. WI amount scores, a
will be noted that, in this scatter diagram, all individuals fall on the , a y resu t. u '~f -h b' ,
an anthmetic computation t t' d s, '1. cae su lect s score:()n
diagonal extending from the upper left- to the lower right-hand comer. , es IS recor ed as the xi' b f
qmred to complete all itenls h"l h' ',pm er a secondsre·
This diagonal runs in the reverse direction from that in Figure 8. , W Ie IS Score on an 'th .
test represents the number of bl ,~, an mehc reasoning
"- .... ('()mnlete "bsence of rdationship, such as
,,1,,.;,,~;."l;,,~t,,~ 1ahon' pro ems correctly soh d .
can be expected In su I h . ~" a negative corre-
. CIa case, t e poorest (i.e.", slowest) individ-
CHAPTER 5
No. of
9
Men
21,474 Reliability
S 19,444
7 32,129
refers to the consistency of scores obtained by the
R
LIABILITY
6 39,398 same persons when reexamined with the same test on different
occasions, or with diHerent sets of equivalent items, or under
5 34,975 othel: variable examining conditions. This concept of reliability underlies
the computation of the error of measurement of a single score, whereby
4 '23,699
we can predict the range of fluctuation likely to occur in a single indi-
3 11,209 vidual's score as a result of irrelevant, chance factors.
The concept of test reliability has been used to cover several aspects of
2 2,139 score consistency. In its broadest sense, test reliability indicates the extent
to which individual diHerences in test scores are attributable to "true"
904
differences in the characteristics under consideration and the extent to
which they are attributable to chance errors. To put it in more technical
terms, measures of test reliability make it possible to estimate what pro-
Ch t Showmg . R el'atIon bet \\'ee ., Performance
n
portion of the total variance of test scores is error variance. The crux of
IG,7. Expectancy aT p.' . Flight Training.
ejectionBattery and Elimination from I1maly the matter, however, lies in the definition of error variance, Factors that
.{FromFlanagan, 1947, p. 58.) might be considered error variance for one purpose would be classified
under true variance for another. For example, if we are interested in
: . ,thin each stanine on the battery who measuring fluctuations of mood, then the day-by-day changes in scores
,the percentage of men scormg \\ I . . . It b seen that 77 percent on a test of cheerfulness-depression would be relevant to the purpose of
, J' fI' ht trammg can e
ailed to comp :t: pnmary. Ig f 1were eiiminated in the course of train- the test and would hence be part of the true variance of the scores. If, on
of the men receIVing a stamne 0 . 9 f 'led to complete the the other hand, the test is designed to measure more permanent person-
'1 I 4 t of those at stamne al
ing, W 111 C on y percen es the ercentage of failures ality characteristics, the same daily fluctuations would fall under the
h'aining satisfactorily, Between these ex.trcm ", Po the basis of this heading of error variance.
. 1 the succeSSl'\'e stanmes. n
decreases consistent y over '. d f
r
I that approximately Essentially, any condition that is irrelevant to the purpose of the test
expectancy chart, it ~uld be predlCt,e , °t e~amPcoe~eof 4 will fail and
represents error variance. Thus, when the examiner tries to maintain
40 percent 0f pI'J0 t ca d e t s who obtam a s amne 1s
itpproximately 60 percent wil1;atis~~~tor~'~b~~~i~ye~f
ing, Similar statements re~a~ m~ hp i
l:~:::~
. flight train-
and failure
each stanine. Thus, an
uniform testing conditions by controlling the testing environment, in-
structions, time limits, rapport, and other similar factors, he is reducing
error variance and making the test scores more reliable. Despite optimum
could be made about. indlvldua s w 6~.~~c:rv;:2 chance of completing testing conditions, however, no test is a perfectly reliable instrument.
. individual. with a. s~amne o.f 4 has ~idin' a criterion-referenced interpre- Hence, every test should be accompanied by a statement of its reliability.
g
primary fhght trammg. Besldebs pro th t both expectancy tables and Such a measure of reliability characterizes the test when administered
. f t t scores it can e seen a d'
tatlon 0 es .' I 'd f the validitv of a test in pre lct- under standard conditions and given to subjects similllr to those con-
expectancy charts glVe a genera 1 ea 0 J
stituting the normative sample. The characteristics of thiss~mple should
ing a given criterion. therefore be specified, together with the type of reliabIlity that was meas-
ured.
iflciplesof Psychological Testing
"~ould, of course, be as many varieties of test reliability as there
,jtionsaffecting test scores, since any such conditions might be 9 : ,
t for a certain purpose and would thus be classified as error vari-
- I ; ",
e types of reliability computed in actual practice, however, are !
few. In this chapter, the principal techniques for measuring the i ! ./Iff III
,
'f}'of test scores will be examined, together with the sources of .JHt-./Iff
illiance identified by each. Since all types of reliability are con-

N
:g•• 60-69
.. ! , II j
!mr ./Iff
,with the degree of consistency or agreement between two inde- i
'g i#ff I ;
'flyderived sets of scores, they can all be expressed in tcrms of a > 50-59 ./Iff./lff'
'on coefficient. Accordingly, the next section will consider some

basic characteristics of correlation cBefficients, in order to clarify
c
o
~ 40-49 ./Iff./lff
./Iff./lffl
./Iff!
,
T '--
j
use and interpretation. ?\fore technical discussion of correlation, as
v
'" :
./Iff 11/
I !
·as more detailed specifications of computing procedures, can be :./Iff./lff
i
i ,
,in any elementary textbook of educational or psychological statis- I ;
; such as Guilford and Fruchter (1973), lilt I

,! !
,,
II I ,
,
I I
0- 0- 0-
I N (""')0. ().. ()o.
gb b ';t'fl'?
N t") ~ Si ~
EA!\'ING OF CORRELATION. Essentially, a correlation coefficient (T) ex- SCore On Variable I
ses the d'egree of correspondence, or relotions1Jip, between two sets FIG. 8, Bivariate Distr'b t' f
cores.Thus, if the top-scoring individual in variable 1 also obtains the I U IOn or a Hypothetical Correlation of +1.00.
op score in variable 2, the second-best individual in v-ariable 1is second
might OCcur by chance If each ind' 'd I'
~stin variable 2, and so on down to the poorest individual in the group, out of a hat to determ'ine hi .1:1
1I.as n~me \"ere pulled at random
, s pOsitIOn In vanahle 1 a d 'f th
'brll there would be a perfect correlation between variables 1 and 2. were repeated for variable" ' n I e process
uch a correlation would ha\'e a value of + 1.00. Under these conditions it -, alzderbo~r near~zero correlation would result.
A hypothetical illustration of a perfect positive correlation is shown in , \Vou e ImpOSSible to d' t d
relative standing in variable 2 from k pre. IC an in ividual's
igure 8. This figure presents a scatter diag~ll.m, or bivariate distrihutiOl/,. ~. The top-sl!Oring Subject in variable a 1 ~~w~edge of l~,s SCore in variable
ach tally mark in this diagram illdicated~e score of one individual in In variable 2. Some individ I 'h b g t Score 11lgh,low, or average
'oth variable 1 (horizontal axis) and vUllable: 2 (vertical axis). It will be both vadables or below av:;'l s n~,gbt hY chhance score above average in
noted that all of the 100 cases in thee groBl) are distributed along u.~ .
111 one variable
' age In ot . ot ers mightf II b
and below in the oth .' '11 .a a Ove average
"diagonal running from the lower left- t~,'the"upper right-hand corner of average in one and at th " .er, sh others mIght be above the
:the diagram. Such a distribution indicates a perfect positive correlation ld b e a\el:lge III the second and f h
wou .e no regularity in the relationshi from '.. ,. so art, There
, (+1.00), since it shows that each individual occupies the same relative The coefficients fOund in t I ~ one mdl\ Idual to another.
position in both variables. The closer the bivariate distribution of scares extremes, having some value ~~ ~'l .p~achce generally fall between these
approaches this diagonal, the higher will be the positive correlation,
Figure 9 illustrates a perfect negative correlation ( -1.00), In this case,
lations between measures of
frequentlv low When a
a~1.t
t an zero but lower than 1,00. Corre-
I,lies are nearly a-lways positive, althoug'h
there is a complete reversal of scores from one variable to the other, The ,. negative con-el t' . b'
such variables, it usually results from th a IOn.IS 0 .tamed between two
best individual in variable 1is the poorest in variable 2 and vice versa, pressed. For example if time e way III which the scores are ex-
this reversal being consistently maintained throughout the distribution. It , ' SCores are correlated with
negat.lYc correlation will probabl ' result. Th ';~:'~' , am.ou~t scores, a
will he noted that, in this scatter diagram, all individuals fall on the an anthmetic computation te t .) d d us, 1f~!ch sublect s score'On
diagonal extending from the upper left- to the lower right-hand comer, . d S IS recor e as the dumb f d
qUire to complete all items wh'l h' '~er a secon sre·
This diagonal runs in the reverse direction from that in Figure 8. t es t represents the number of' I e IS Score on an arith
hI '''.'
t'
me IC reasoning
,,- 1..•;,,~ ;."l;r·~tr'~ ('omnlete flbsellce of rdationship, such as Ia t'Ion can be expected. In SUell pro ems correctly sol\!cd
'h
'
. :,<:,' a negatIve cone-
,a case, t e poorest (I.e., slowest) individ-
Reliability 107
tive.
'll bWhen1 some prod uc t s are posItive
.. and some negative the correlation
I W1 e c ose to zero. '
/I
I \ In actual practice it's t
standard score befo' ~ d~o n~cessary to convert each raw scorc to a
./ill I \ can be mad . re n mg t e cross-products, since this conversion
, he once for all after the cross-products have been added There
./iIt./ill are manydemonst.
method s ortcuts foar .computmg.the
. . The
Pearson correlation coefficient.
\ I
meanin
'"~
.9
60-69
-
11IIJIlt
JIlt 1/1
Jlltl/tf
i that l rf. of the ~terr~latIon
m. Table 7 not the quickest,
l~
coeffiCient more clearly
but it illustrates the
than other methods
Ii .IIII./iII \ Pears~~ I:~;~:~~t:~~::i\~hor::uts. Table 7 shows the computation of a
> 50-59 ./ill
c to each child's nam ~1e IC and reading scores of 10 children. Next
o 11II11II1 I
~ 040-049 ./ill ,I i reading test (Y) T~ are. hiS s~ores in the arithmetic test (X) and the
o i
u
Vl
./iII./iII. the res ective c~l e sums an . means of the 10 scores are given under
II \ each aJthm ti umn;- The thU'? column shows the deviation (x) of
1 .11/I11I the deviatio~ (yS~o~;
ero~1 thed~nthmetic mean; and the fourth column,
I \ deviations are squareda~n ~~: ;::g /~ore fr~m the reading mean. These
1/1
squares are used in . x wo co umns, and the sums of the
i
0- 0- 0- 0- and reading scores ~~~K:t:::!t~h~ ~and~~d /~viations of the arithmetic
1 '? '? r;- dividing each x and y by'ts . 0 eSdc~le m Chapter 4. Rather than
~ ~ ~ R 1 correspon mg u to find standard scores, we
Score on Variable 1
Ic.9. Bivariate Distribution for a Hypothetical Correlation of -1.00.

TABLE 7
Computation of Pearson Product-Moment Correlation Coefficient
. 'ualwillhave the numerically highest score on the first test, while the best
Arith- Read-
individualwill have the highest score on the second. metic ing
Correlation coefficients may be computed in variom ways, depending Pupil X Y x y x:z y' xI}
on the nature of the data. The. most common is the Pearson Product-
Moment Correlation Coefficient. This correlation coefficient takes into Bill 41 17 +1 -4 1 16 - 4
a.ceountnot only the person's position in the group, but also the amount Carol 38 28 -2 +7 4 49 -14
I
of his deviation above or below the group mean. It will be recalled that Geoffrey 48 22 +8 +1 64 1 8
. wheneach individual's standing is expressed in}erms of standard scores, Ann 32 16 -8 -5 64 25 40
personsfalling above the average receive positive standard scores, while Bob : 34 18 -6 -3 36 9 18
thosebelow the average receive negative scores. Thus, an individual who Jane 36 15 -4 -6 16 36 24
is superior in both variables to be corre1al:ed,:would have two positive Ellen 41 24 +1 +3 1 9 3
standard scores; one inferior in both woul~ have two negative standard Ruth 43 20 +3 -1 9 1 - ~
Dick 47 23 +7 +2 49 4· 14
scores.If, now, we multiply each individ\i&r" tandard score in variable I
Mary 40 27 0 +6 0 36 0
by his standard score in variable 2, all.at . products will be positive,
provided that each individual falls on theA.ame side of the mean on both S 400 210 0 0 2~4 186 86
M 40 21
variables. The Pearson correlation coefficje,))t is Simply the mean of these
products. It will have a high positive val\ie:'W~~n corresponding standard
scores are of equal sign and of approximately equal amount in the two
fT. = IN
10 -- . = v'24.40= 4.94 fT, = ~186 --
10 = v'18.60= 4.31
variables. When subjects above the average in one variable are below the r,,=~= 86 86
average in the other, the corresponding cross-products will be negative. NUru. (10)(4.94)(4)R} = 212;91=.40
If the sum of the cross-products is negative, the correlation will be nega-
I ? "':.:'Ii';~l
. .'''_~~i~i
.[
' -
Reliability 109
'08 Prillcip1t's of PS!Jchological T('8ting , whether the two variables are correlated in the population from which
t the end as shown in the correlation the sample was drawn.
form this division only once ad' ' the last column (xI)) have
Th oss-pro uets m' d The minimum correlations significant at the .01 and ,05 levels for
ula in Tab 1e, 7 e cr , d' g deviations in thc x an y
1 · l' the cOITespon lll' d t groups of different sizes can be found by consulting tables of the signifi-
en found by mu tIp ymg '( r) the sum of these cross-pro uc s cance of correlations in any statistics textbook. For interpretive purposes
lumns,To compute the _~orrelatlOn(N ) , and by the product of the two
divided bv the number. of cases in this book, however, only an understanding of the general concept is
required. Parenthetically, it might be added that significance levels can
ndard de~'iatiol1s (11':<Uy),
be interpreted in a similar way when applied to other statistical measures.
For example, to say that the difference between two means is significant
correlation of ,40 found in Table 7 ind~- at the .01 level indicates that we can conclude, with only one chance out
STATISTICAL SIGNIFICAJ'CE, The , 1 t' hl'p between the arithmetic
f ositwe re a Ions 11 of 100 of being wrong, that a difference in the obtained direction would
tes a moderate d egree 0 p d for those children doing we
ency be found if we tested the whole population from which our samples were
reading scores. There is some 1ten h adl'ng test and vice versa, al-
f wel on t e re ,h h drawn. For instance, if in the sample tested the bo),s had obtained a
arithmetic also to ~er orm If we are concerned only Wit t e
significantly higher mean than the girls on a mechanical comprehension
ugh the relation IS not close, cept this correlation as an
'h'ldren we can ac test, we could conclude that the boys would also excel in the total popu-
rmance of the,se 10 c 1 , fIt' existing between the two
. f th degree 0 re a lOn lation,
uate descriptlOn 0 e '1 ch however we are usu-
I holog1ca resear , ' d 1
" les in this group. n psyc d h t'cular sam1J/e of indivi ua s
1" beyon t e par 1
terested in genera lZln~ h'ch the represent, For example, we THE RELIABILITY COEFFICIENT.Correlation coefficients have man)' uses
to the larger populatIOn ": 1 etic ;nd reading ability are corre- in the analysis of psy.chological data, The measurement of test reliability
t want to know whether anthm f h e age as those we tested, represents one application of such coefficients. An example of a reliability
. h lchildren 0 t e sam .
amongAmencan sc 00 . d vould constitute a very m- coefficient, computed by the Pearson Product-Moment method, is to be
iously,the 10 cases actually ~xamAlneth'r comparable sample of the found in Figure 10. In this case, the scores of 104 persons on two equiva-
: 'f 1 opulatlOl1, no e .
uate sample 0 sUf 1 a p much higher correlatIOn. lent forms of a Word Fluency test' were correlated. In one form, the sub-
size might yield a much lowfer ort~ tl'ng the probable fluctuation jects were given five minutes to write as many words as:'they could that
. . 1 dures or es Ima
ere are stabshca proce , Ie in the size of correlations, means, began with a given letter. The second form was identical, except that a
'expectedfrom sample to samp , The <!uestion usually different letter was employed. The two letters were chosen by the test
, , d' ther group n1easures, . '
rd deviations, an an) 0 .' 1 whether the correlation IS authors as being approximately equal in difficulty for this purpose.
, however IS SImp v , h
,about carre 1atlOns, , h 'd l'f the correlation 111t e The correlation between the number of words written in the two forms,\
. In ot er war 5,
antly greater t 1lan zelo. . as hi h as that obtained in our of this test was found to be ,72. This correlation is high and significant at
'lion is zel'O, could a cOTTel~hon glne? When we say that a the ,01 level. With 104 cases, any correlation of .25 or higher is significant
d f P lmg error a 0 '
have resulte rom sam t (01) level" we mean the at this revel. Nevertheless, the obtained correlation is somewhat lower
, ,... 'fi t at the 1 percen. , 1
bon IS slgm can t of 100 that the population corre n- than is desirable for reliability coefficients, which usually fall in the .80's
are no greater than one ou h t van'ables are truly corre- or .90's, An ~nation of the scatter diagram in Figure 10 shows a
, 1 de that t e wo ,
zero. Hence, we conc U h risk of error we are willing to ta~e typical bivariate distribution of scores corresponding to a high positive
ignificancelevels refer to ~ e If a correlation is said to be Slg- correlation. It will be noted that the tallies cluster c~ose to the diagonal
t
ing conclusions from our ~;"t f error is 5 out of 100. Most extending from the lower left- to the upper right-haridcorner; the trend
at,the .05 ~eve1, th~ pr;;~e; ~~e ~01 or the ,05 levels, although is definitely in this direction, although there is a certain amount of scatter
oglcalresearch applies 10 ed for s ecial reasbns~ of individual entries. In the follOWing section, the uSe of the correlation
,ificancelevels may be. emp b( 7 f 'Is fo reach significance even coefficient in computing different measures of test reliability will be con-
rrelation of .40 found 111Ta e,. al d 't,h';ill11\,10 cases it is sidered. '
, ht h e been antiCIpate ,WI ..•~;r,Y f
level. As mlg av l' h' conc1usively~\Yith this size 0
o establish a general re at,lOn.s Ip t a";"t 'the "05-1eve1 is .63. Any lOne of the subtests of the SRA Tests of Primary Mental Abilities' for Ages 11 to
, 1 t' Igm can 6 ' , 17. The data were obtained, in an investigation by Anastasi and Drake (1954).
he smallest corre a Ion s, '" " wered the question of
,n below that value simply leaves unans
ReliabilifY 111
less susceptible the scores are to the random daily changes in the condi-
tion of the subject or of the testing environment.
When retest reliability is reported in a test manual, the interval over
which it was measured should always be specified. Since retest correla-
tions decrease progressively as this interval lengthens, there is not one
.1
I but an infinite number of retest reliability coefficients for any test. It is
i
also desirable to give some indication of relevant intervening experiences
, \
\-1
;
of the subjects on whom reliability was measured, such as educational or

job experiences, counseling, psychotherapy, and so forth.
I : ."
I Apart from the desirability of reporting length of interval, what con-
\ i -HH" siderations should guide the choice of interval? Illustrations could readily
be cited of tests showing high reliability over periods of a few days or
weeks, but whose scores reveal an almost complete lack of correspond-
ence when the interval is extended to as long as ten or fifteen years.
Many preschool intelligence tests, for example, yield moderat~ly stable
1
\ "
:
I.
1111
measures within the preschool period, but are virtually useless as pre-
1 i dictors of late childhood or adult IQ's. In actual practice, however, a
I \ 1111 ',.jilt I \o/Ht'lII; ,
simple distinction can usually be made. Short-range, random fluctuations
that occur during intervals ranging from a few hours to a few months are
generally included under the error variance of the test score. :rhus, in
checking this type of test reliability, an effort is made to keep the interval
short. In testing young children, the period should be even shorter than
for older persons, since at early ages progressive developmental changes
()."f0'0"t ~~
$
1
~
I are discernible over a period of a month or even less. For any type of
("") M ~ I I It) 0
~ b J, ~ ~ ~ ~ ~ -0. " person, the interval between retests should rarely exceed six months.
N (") M IT'
Score on FormJ: Word F veney e. Any additional changes in the relative test performance of individuals
that occur over longer periods o£ time are apt to be cumulative and pro-
Flc.l0. A Reliability Coefficient of .72. gressive rather than entirely random. Moreover, they are likely to charac-
·<:.(Dalafrom Anastasi & Drake, 1954.) terize a broader area of behavior than that covered by the test perform-
l ance itself. Thus, one's general level of scholastic aptitude, mechanical
;1;;:TYPES OF RELIABILITY comprehension, or artistic judgment may have altered appreciably over
r,
, TEST-RETEST RELIABILITY. The m.
ost obvious method for finding the re-
h'd ntical test on a second occa-
a ten-year,period, owing to unusual intervening experiences. The indi-
vidual's status may have either risen or dropped appreciably in relation
to others of his own age, because of circumstances peculiar to his own
liabilityof te.st ~c~res is by. rcpeCatll1)gi:;h~S:ase is simply the correlation
home, school, or community environment, or for other reasons such as
. " sian.The I'ehablhty coeffiCIent Tn on the two administra-
' d by the same persons illness or emotional disturbance.
~betweenthe scores 0bt ame d to the random fluctua-
. Th . variance correspon s The .extent to which such factors can affect an individual's psycho-
" lionsof the test. e enor . t the other These variations
. f test seSSIOn 0 • logical development provides an important problem for investigation.
tionsof performance rom one n d t ting conditions such as extreme This question, however, should not be confused with that of the reliabil-
may result in part from uncontr? e eds ther distractions or a broken
. th dden nOlses an 0 '. h ity of a particular test. When we measure the reliability of the Stanford-
changes m wea er, su h they arise from changes in t e
Bin~t, for example, we do not ordinarily correlate retest _~~res over a
pencil point. To so~e ext:nt, lfowev~~~strated by illness, fatigue, emo-
penod of ten years, or even one year, but over a few ,,~et:1ks. '-T.p be sure,
condition of the subject h1l11Se.' as 1 f pleasant or unpleasant nature,
. ecent experIences 0 a long-range retests have been conducted wit~ such tests-; bpt the results
tionalstram, worry, r . ., h the extent to which scores on a test are ~enerally discussed in terms of the predictability of adult intelligence
and the like. Retest reliabIlIty sows. th higher the reliability, the
can hr I!eneralized over different occaSlDns; e
Prillciples of PsycllOlogical Testing
Reliability 113
omchildhood performance, rather than in terms of the reliability of a testing purposes 110.... I
rticulartest. The concept of reliability is generally restricted to short- measure for e\'al~at' 'ever, a temate-form reliability provides a useful
mg many tests.
ge, random changes that characterize the test performance itself The concept of item sam Iin '
.r;ilherthan the entire behavior domain that is being tested, alternate-form reliability bu~ al~ ~;hcontellt salllpl~llg: ?lIderlies not only
It should be noted that different behavior functions may themselves shortlv. It is the f . er types of reltabIhty to be discussed
- re ore appropnate to ex . 't
.ry in the extcnt of daily fluctuation they exhibit. For example, steadi- has probably h d th' amlOe 1 more close lv, Everyone
a e expenence of taking . ..-
ess of delicate finger movements is undoubtedly more susceptible to he felt he had a "I k b k" a course exammatlOn in \vhich
, ht changes in the person's condition than is verbal comprehension, If very topics he happue~:d t~e~aveb;~:~:e many of the items covered the
wish to obtain an over-all estimate of the individual's habitual finger easion, he may have had th . ed mo~t carefully, On another oc-
diness, we would probably require repeated tests on several days, large number of l't e opposite expenence, finding an unusually
ems on areas he had f 'I d .
reas a single test session would suffice for verbal comprehension, situation illustrates error va . I al e to reVICW, This familiar
~gainwe must fall back on an analysis of the purposes of the test and what extent do Scores on th.n~nc: ;esu ting from content sampling, To
9iJ a thorough understanding of the behavior the test is designed to pre- ticular selection of items? I:sa ~'ff epen? on ~actors speci~c to the par-
Biet, ently, were to pre!)are another te It ~rent IO vestlgator,workmg independ-
:'l' Although.apparently simple and straightforward, the test-retest tech- t' h s In accor dance with the 'fi
IOns, ow much would an indi .d l' . same speci ca-
, '~iquepresents difficulties when applied to most psychological tests. Let us suppose that a 40't VI ua bS slcore differ on the two tests?
-I em voca u ary t t h b
lPracticewill probably produce varying amounts of improvement in the a measure of general verbal c ,e.s - as een constructed as
~testscores of different individuals. Moreover, if the interval between re- ~ist of 40 different words is ass~:b~:~e;~:~~~ :ow suppose that a second
estsis fairly short, the examinees may recall many of their former re- Items are constructed with I ame purpose, and that the
ooses.In other words, the same pattern of right and wrong responses cultv as the first test The d,effqua can~ to cover the same range of diffi-
. d: , ,I erences 111 the sco e bt' d b
_likelyto recur through sheer memory. Thus, the scores on the two ad- m lVIduals on these two tests 'II t r s 0 ame y the same
1Jlinistrations of the test are not independently obtained and the correIa- ,IUS rate the type of '
conSIderation. Owing to fortuitous f . error vanance under
between them will be spuriously high, The natt\re of the test itself ferent individuals the relat' , d'ffi aftors In the past experience of dif-
ay also change with repetition, This is especially true of problems in- what from pcrso~ to pe !VeT]·1 cu ty of the two lists will vary Some-
rson. IUS the Ii t I' t . h
lyingreasoning or ingenuity. Once the subject has grasped the princi- number of words unfamiliar to individ ;s IS mIg t contain a larp;el-
involvedin the problem, ur once he has worked out a solution, he can The second list on the oth h d .ua A than does the second list.
roduce the correct Iesponse in the future without going through the 1arge number of' words unfamiIi er an mIght co t'
t
I •
d'
d' 'd n am a Isproportionately
ervening steps. Only tests that are not appreciably affected by.'if!.'Jeti- ar 0 111 IVI ua lB If the t . d"d I
are apprOXimately equal in thei II . WO 111 IVI ua ~
n lend themselves to the retest technique, A number of sensory dis- "true scores") B' will neverth I r overa word knowledge (i.e., in thei~
(~riminationand motor tests would fall into this category, For the large excel B on th~ second The eIe~ excel A on the first list, while A will
re
,majorityof psychological tests, however, the retest technique is inap- therefore be reversed o'n th t a ].ve standing of these two persons will
. e wo Ists o' t h
ropriate. selection of items, ' wmg 0 c anee differences in the
Like lest-retest rcliabilit, alt .£ ' ..
accompanied by a stateme~' f t~rntc- ~rm rdl~bIhty should always be
. ALTERNATE-FORM RELIABILITY. One way of avoiding the difficulties en- ministrations as well as ado . t~ engft of the mterval between test ad-
untered 1n test-retest reliability is through the use of alternate forms If t h·'e two forms are administered'
escnp Ion 0 relevant' t
.
.
In ervenmg experiences.
the test. The same persons can thus be tested with one form on the correlation shows reliabilit Ifn Immediate succession, the resulting
. y across orms only not .
stoccasjon and with another, comparable form on the second. The cor- error vanance in this cas 8' ' across occasIOns. The
e represents uctuat'o' f
lation between the scores obtained on the two forms represents the one set of items to another b t H ,I ns In per ormance from
'ability coefficient of the test. It will be noted that such a reliability In the d I ' u not uctuations over time
eve Opment of alternate forms h Id· .
efficientis a measure of both temporal stability and consistency of cised to ensure that the are trul ' care s ou ..?f ('Ourse be exer-
nse to different item samples (or test forms). This coefficient thus of a test should be jnd~endc t{ parallel. F~ndamentaJ)y, parallel forms
binestwo ty,pes of reliability. Since both types are important for most same specifications. The tests :h~ ~nstruct~ tests desi~ed to meet the
U ('Ontam the same number .. .
of 1't elDS,
Reliabilify 111
less susceptible the scores are to the random daily changes in the condi-
tion of the subject or of the testing environment.
When retest reliability is reported in a test manual, the interval over
which it was measured should always be specified. Since retest correla-
tions decrease progressively as this interval lengthens, there is not one
I
.1
but an .infinite number of retest reliability coefficients for any test. It is
i
also desirable to give some indication of relevant intervening experiences
\-i."
I : \
of the subjects on whom reliability was measured, such as educational or
job experiences, counseling, psychotherapy, and so forth.
\I :
I
. Apart from the desirability of reporting length of interval, what con-
siderations should guide the choice of interval? Illustrations could readily
\ \
be cited of tests showing high reliability over periods of a few days or
\ weeks, but whose scores reveal an almost complete lack of correspond-
" ;4!It 1/: 1/1
ence when the interval is extended to as long as ten or fifteen years.
\ \ j
Many preschool intelligence tests, for example, yield moderarely stable
\ 4!It \ " I III/ measures within the preschool period, but are virtually useless as pre-
dictors of late childhood or adult IQ's. In actual practice, however, a
1/11 '.flit I \.fIIt1H1
!
simple distinction can usually be made. Short-range, random fluctuations
that occur during intervals ranging from a few hours to a few months are
generally included under the error variance of the test score. :rhus, in
checking this type of test reliability, an effort is made to keep the interval
short. In testing young children, the period should be even shorter than
~ for older persons, since at early ages progressive developmental changes
0- ~ 0-
0()
are discernible over a period of a month or even less. For any type of
Ii')
I
0()
I I "-1
Ii') 0
Ii') 0
0 Ii') 0
~
Ii')
~
0
Ii') Il'l 0() 0()
"- person, the interval between retests should rarely exceed six months.
sc:e onMFormJ: Word fluencY Test Any additional changes in the relative test performance of individuals
that occur over longer periods of time are apt to be cumulative and pro-
,. '!G. 10. A Reliability Coefficient of .72. gressive rather than entirely random. Moreover, they are likely to charac-
Data from An8~tasi & Drake, 1954.) terize a broader area of behavior than that covered by the test perform-
ance itself. Thus, one's general level of scholastic aptitude, mechanical
':TYPES OF RELIABILITY comprehension, or artistic judgment may have altered appreciably over
a ten-year, period, owing to unusual intervening experiences. The indi-
, ost obvious method for finding the re- vidual's status may have either risen or dropped appreciably in relation
TEST-RETEST RELIABILITY. The m. h 'dentical test on a second occa-
to others of his own age, because of circumstances peculiar to his own
.. liabilityof test scores is by. rcpeatlll)g.t :h~ ase is simply the correlation
home, school, or community environment, or for other reasons such as
.: 'sion.The l'eliability coefficlenf (Tn III IS C, n the two administra-
. d b the same persons 0 illness or emotional disturbance.
\[1betwe~i'Ithe scores 0b tame Y. d to the random fluctua-
. Th . vanance correspoll S The .extent to which such factors can affect an individual's psycho-
'; tions of the test. e enor . t the other These variations
.. f e test seSSIOn 0 • logical development provides an important problem for investigation.
" tions of performance rom on II d t t'ng conditions such as extreme This question, however, should not he confused with 'that of the reliabil-
, I' rt f ncontro e es 1 ' k
may resu t 111 pa rom u . d ther distractions or a bro en ity of a particular test, When we measure the reliability of the Stanford-
I • h dden nOlses an 0 " h
changes 111 we at er, su th y arise from changes m t e
. . T extent however, e . Binet, for example, we do not ordinarily correlate retest :~~res over a
pend pomt. 0 so~e .' f 'Uustrated by illness, fatigue, emo- period of ten years, or even one year, but over a few weeks,'~'£p he SUfe~
condition of the subject hmlsel : as 1 f pleasant or unpleasant nature,
· recent expenences 0 a t long-range retests have been conducted wit~ such tests:; bpt the .fcsults
tionaI stram, worry, . ., h the extent to which scores on a tes are generally discussed in terms of the predictability of adult intelligence
and the like. Retest rehablhty sows. the higher the reliability, the
can he I':eneralized over different occaSIOns;
Prillciples of Psychological Testing
Reliability 113
om childhood performance, rather than in terms of the reliability of a
rticulartest. The concept of reliability is generally restricted to short- ~:~:~t~;~~':~~~~~:g'enYlear, altternate-form reliability provides a useful
ny ests.
nge, random changes that characterize the test performance itself The concept of item sam tin '
therthan the entire behavior domain that is being tested. altemate-fOlm reliability bu~ al~ ~;hcontellt sampl:llg: ~nderlies not only
It should be noted that different behavior functions may themselves short Iv. It is the f . er types of reltabllIty to be disclIssed
, in the extent or daily fluctuation they exhibit. For example, steadi- has p;obably h drethoreappr.opnate to examine it more closely, Everyone
a e expenence of tak' g ,
of delicate finger movements is undoubtedly more susceptible to he felt lIe had a "I k b k» 'In a course examination in which
uc v rea because f h .
ht changes in the person's condition than is verbal comprehension, If very topics he happen~d to have studi many 0 t e Items covered the
wish to obtain an over-all estimate of the individual's habitual finger casion, he may have had th ' ed mo~t carefully, On another oc-
diness, we would probably require repeated test~ on several days, large number of I't e opposIte expenence, finding an unusually
ems on areas he had f 'I d .
'hereas a single test session would suffice for verbal comprehension, situation illustrates error' I al e to reVICW. This familiar
gainwe must fall back on an analysis of the purposes of the test and what extent do Scores on ~~n~nc: ;esu ting from content sampling. To
i1 a thorough understanding of the behavior the test is designed to pre- ticu]ar selection of items? Ifls eds'Hepen? on factors specific to the par-
t. . a I erent mvestigator k' . d
ent Iy, were to preIJare another t t' d ' wor mg In epend-
Althoughapparently simple and straightforward, the test-retest tech- t' h es m accor ance with th 'fi
IOns, ow much would an indi .dr, e same specI ca-
ique presents difficulties when applied to most psychological tests, Let us suppose that a 40-'t VI ua bS slcore differ ort the hm tests?
I em voca u ary test h b
.racticewill probably produce varying amounts of improvement in the a measure of general verbal c h' . - as een constructed as
.testscores of different individuals. Moreover, if the interval between re- ~ist of 40 different words is ass~:1~:d e:::~~~ ~ow suppose that a second
s is fairly short, the examinees may recall many of their former I'e- Items are constructed with I ame purpose, and that the
qua
. Dnses.In other words, the same pattern of right and wrong responses culty as the first test The d.eff can; to cover the same range of dim-
. d: , '. I erences 111 the sco e bt' d b
.4 likelyto r~cur through sheer memory. Thus, the scores on the two ad- III JVldua]s on these two tests ']1 t r s a ame y the same
inistrationsof the test are not independently obtained and the correIa- . I us rate the type of '
consIderation. Owing to fortuito f ' error vanance under
n between them will be spuriously high, The natnre of the test itself ferent individuals the relat' d~~ ators In the past experience of dif-
:ayalso change with repetition, This is especially true of problems in- what from pcrso~ to pe Ive ·I cu ty of the two lists wiII vary Some-
rSOll. TIlUS the fi t I' t . h
DIvingreasoning or ingenuity. Once the subject has grasped the pdnci- number of words unfamiliar to individ rs IS mlg t contain a larger
"Ieinvolvedin the problem, or once he has worked out a solution, he can The second list on the oth h d ,ua] A than does the second list.
produce the correct response in the future without going through the 1arge number of' words unfamilia er an might conta'n
t' . d"d
d'
I I a Isproportionately
itervellingsteps, Only tests that are not appreciably affected by"lfi,eti- lVI
are apprOXimately egual in the; r 0 111n ua B. If the two individual~
tiDnI~nd themselves to the retest technique. A number of sensory dis- ov ra
"true scores"), B -will neverthele:s : word knowledge (i.e., in their
,criminationand motor tests would fall into this category. For the large excel B on the second Th ], e cel A on the first list, while A will
re
ajority of psychological tests, however, the retest technique is inap' therefore be reversed o'n the t atll.ve standing of these two persons will
. eWolstso' t h
opriate, selection of items. ' wmg 0 c ance differences in the
Like lest-retest rcliabilit· alt 'f '. ,
accompanied by a stateme~' f t~m:te- ~nn rell~blhty should always be
ALTERNATE-FORM RELIABILITY. One way of avoiding the difficulties en- ministrations as well as ado , t~ engft of the Interval between test ad-
imteredin test-retest reliability is through the use of alternate forms If t h·'e two forms are administered'
escnp Ion 0 relevant . t
.
'
111 ervenmg experiences.
the test. The same persons can thus be tested with one form on the correlation shows reliabilit 'fn Immediate succession, the resulting
stDccasjonand with another, comparable form on the second. The cor- . y across orms only not .
error vanance in this cas fl' ' across occasIOns. The
e represents uctuat'o' f
ation between the scores obtained on the two forms represents the one set of items to another b t R . I ns In per ormance from
'ability coefficient of the test. It will be noted that such a reliability In the d I ' u not uctuations over time
eve Opment of alternate forms h Id" .
cient is a measure of both temporal stability and consistency of cised to ensure that the are tm] , care s ou ..of (,'ourse be exer-
nse to different item samples (or test forms). This coefficient thus of a test should be ind~endc t{ parallel. Fundamentally, parallel forms
binestwo types of reliability. Since both types are important for most same specifications. The tests :h~ ~nstruct~d tests desi~ed to meet the
U contam the same number .
of items ,
Reliability lIS
:,d the 'items should be expressed in the same form and should cover the To find split-half reliabilit tl Ii. .
metype of content. The range and level of difficulty of the items should order to obtain th y, Ie 1st problem IS how to split the test ill
o be equal. Instructions, time limits, illustrative examples, format, and divided in man ~ most nearly comparable halves. Any test can be
I other aspects of the test must likewise be checked for comparability. second half w~urd dl~e~ent wars. In most tests, the Rrst' half and the
no
It should be added that the availability of parallel test forms is desir- difficulty level of 't e comparable, owing to differences in nature and
Ie for other reasons besides the determination of test reliability. Alter- I ems, as well as to the cu I t' If f
Ul), })ractice fatig b d mu a Ive e ects 0 warming
, ue, ore am and am' tI f
te forms are useful in' follow-up studies or in investigations of the
ects of some intervening experimental factor on test performance. The
th
sively from the beginning to th~ end ~f at Ie; ;ctors varying progres-
quate for most purposes is to fi d th e es.. procedure that is ade-
useof several alternate forms also provides a means of reducing the pos- of the test. If the items we .n. e scores on the odd and even items
sibilityof coaching or cheating. of difficulty such a dl' . ~e on?llndally an.anged in an approximate order
Although much more widely applicable than test-retest reliability, al- , VIsIon Yle s verv ne I· . I
One precaution to b b d . .' ar)' eqUlva ent half-scores.
"temate-form reliability also has certain limitations. In the first place, if e a serve 111 making such dd I'
to groups of items d l' . h' an a -even sp It pertains
the behavior functions under consideration are subject to a large practice ea mg WIt a smale problem h
ferring to a particular mechanical di~ . ' sue. as questions re-
elfeet, the!'use of alternate forms will reduce but not eliminate such an reading test. In this case a whole r glam. or to a gIven passage in a
'effect. To be sure, if all examinees were to show the same improvement tact to one or the other h~lf \Vere ~ o~p of ~tems should be assigned in-
with repetition, the correlation between their scores would remain un- in different halves of the t~st th .e I:e~ls In such a group to be placed
,"affected,since adding a constant amount to each score does not alter the spuriousl inflated' . '. e Slml anty of the half-scores would be
<:orrelationcoefficient. It is much more likely, however, that individuals might aIf~ct items 'i~l~c;t~n~a~~,:~.leerror in understanding of the problem
will differ in amount of improvement, owing to extent of previous prac-
Once the two half-scores have b b' d
tice with similar material, motivation in taking the test, and other factors. be correlated by the usual m th een a tame for each person, they may
Under these conditions, the practice effect represents another source of correlation actuallv gives th e °l.d'b~lt. shoufld be noted, however, that this
variance that will tend to reduce the correlation between the two test 'f hoe re la I It" a onlv a half test F 'I
I t e entire test consists of 100 ite - h ' . - . . or examp e,
forms, If the practice effect is small, reduction will be negligible. tween two sets of scores each a .ms,. t e correlatIon IS computed be-
Another related question concerns the degree to which the nature of test-retest and alternate-fotm r:I;;~~~~,ls bas~d on only 50 items, In both
the test will change with repetition. In certain types of ingenuity prob- based on the full nu b f ' . -' on t e other hand, each score is
lems, for example, any item involving the same principle can be readily . m er 0 Items In the test
Other thmgs being equ I th I .
solved by most subjects once they have worked out the solution to the It is reasonable t . a I' e ~nger a test, the more reliable it will be?
first. In such a case, changing the specific content of the items in the o expect t Iat, WIth a lar If'
arrive at a more adequate and . ger samp e a behaVIOr, we can
second form would not suffice to eliminate this carry-over from the first . ' consIstent measure The ff t th I h
emng or shortening a test will hav . , .' e ec at engt -
form. Finallv, it should be added that alternate forms are unavailable for means of the Spearman-Bra f e allI Its ~oefficlent can be estimated by
many tests, because of the practical difficulties of constructing compara- wn ormu a, gIVen below:
ble forms. For all these reasons, other techniques for estimating test re- nr'lI
liability are often required. 'II =: ~--,,----_
, l+(n-l)r'u
'1
in which t is the estimated ffi'
n is the number of times th ~o~. c~ent, ~11 the obtained coefficient, and
SPLIT-HALF RELIABILITY, From a sin'gle,:administration of one form of a number of test items is incr:a eS ~ eng~ ened or shortened. Thus, if the
test it is possible to arrive at a measure 'of, reliability by various split-half d
from 60 to 30, n is %. Th sse rom 2.'Jto 100, n is 4; if it is decreased
procedures. In such a way, two scores are obtained for e~c1i person by determining reliability bv ~heP:ari~~ntrown formula is Widely used in
dividing the test into comparable halves. It is apparent that split-half porting reliability in this 'fo p a f m.ethod, m~ny test manuals re-
reliability provides a measure of consistency with regard to content sam- , formula always involves do~~in"'~~: tpphed to spht-haIf reliability, the
pling. Temporal stability of the scores does Ilot enter into such reliability, clitions, it can be simplified as f~Iows:ength of-the test. Under these con-
because only one test session is involved: This type of reliahility co-
efficient is sometimes called a coefficient of internal consistency, since 2 Lenulhening a test h .I .
only a single administration of a single form is required. " . ' owever, wll Increase 0 I "t, " .
tent samplmg not its sl b'I't .,' n y. I S conSIstency m tenns of con-
, a II} over hme (see Cureton, 1965). '
Reliability U7
2r'1I
Tn = 1 + r'lI ·of 20 by the successful completion of 5 pcrccptual speed, 5 spatial rela-
tions, 10 arithmetic reasoning, and no vocabulary items,
. s it-half reliability was developed by Many other combinations could obViously producc the same total score
An alternate method for findmg p. f th differences between
. 0 Ily the vanance a e I of 20. This Score would have a very different meaning when obtained
Ion (1939). It reqUires I If t ( , ) and the variance of tota
' the two ha -tes Sad f I through such dissimilar combinations of items. In the relatively homoge-
ch person s scores on b 't t d in the following ormu a,
res (a'r); these two values aTe su stJ u e. ,. neous vocabulary test, On the other hand, a Score of 20 would probably
hich yieids the reliability of the whole test duectl) . mean that the Subject llad succeeded with approximately the first 20
u'e! words, if the items were arranged in ascending order of difficulty, He
111 = 1- -,-
u:;
might have failed two or three easier words and correctly responded to
two or three more difficult itcms beyond the 20th, but such individual
,r , hi of this formula to the definition of
. It is interesting to note the relations p 's scores on the two half- variations are slight in comparison with those found in a more heteroge-
'. , A d'ff ce between a person . 'd d neous test .
'errorvanance. ny I eren 'f these differences, dlvl e
' h r The vanance 0 ,
. 'tests represents c ance eTTO. , 'es the roportion of error variance 111 A highly relevant question in this connection is whether the criterion
by the variance of total scores, gl\ 'b P t d from 1 00 it gives the that the test is trying to predict is itself relatively homogeneous or heter-
- h' 'ariance IS SU trac e , ,
he scores. When t IS error \ h' h . I to the reliability coefficient. ogeneous. Although homogeneous tests are to be preferred because their
proportion of "true" variance, w IC IS equa
Scores permit fairly unambiguous interpretation, a single homogeneous
test is obViously not an adequate predictor of a highly heterogeneous cri-
, . A fourth method for finding reliability, terion. lvforeover, in the prediction of a heterogeneous criterion, the
KUDER·RICHARDSON RELIABILIT1:.. f . I form is based on the heterogeneity of test items would not necessarily represent error variance.
. 1 d" t 'ahon 0 a slllg e ,
also utiliZing a slIlg e a mmlslII , , the test This interitem con- Traditional intelligence tests provide a good example of heterogeneous
f onses to a Items m .
consistencv 0 resp f ariance' (1) content sam-
'. a
,;:sistenclj is ~n uence .
d by two sources a error v ,
d s lit-half reliability); and (2)
h
etero-
tests designed'to predict heterogeneous criteria. In such a case, however,
it may be desirable to construct several relatively homogeneous tests,
\1 piing (as III altemat~-form an. p m led. The more-homogeneous the each measuring a different phase of the heterogeneous criterion, Thus,
0' geneitv of the behavlOr domalll sa.P ' For example if one test in- unambiguous interpretation of test scores could be combined with ade-
' • h' h tl . lteritem conSIS t enc\. , b quate criterion coverage.
domain, the Ig er Ie 11 h'1 lo'ther cOllllJrises addition, su _
I . I' l' 'tcms w leal b hI
" eludes only mu tip Ica IOn I ..'.. the former test will pro a y The most common procedure for finding interitem consistency is that
I· I' t' and dIVISIOnItems,
' traction, mu tip Ica lOn, h th latter In the latter, more developed by Kuder and Richardson (1937). As in the split-half methods,
. 't onsistenc\' t an e, 0 h .
' show more mten em c "f better in subtraction t an III interitem consistency is found from a single administration of a single
' t t e subJ'ect ma\' per orm 1
' heterogeneous es, on. "ons' another subject may score re a- test. Rather than requiring two half-scores, however, such a technique is
~, any of the other arithmetIc operatl ly in addition, subtrac-
based on an examination of performance on each item. Of the various
0 ,
h d' " 'tems but more poor b

tively well on t e IVI510n I , A ore extreme example would e formulas derived in the original article, the most Widely applicable, com-
tion and multiplication; and so on. mb I items in contrast to one monly known as "Kuder-Richal'dson formula 20," is the follo ing:
' b t . ti I IT of 40 voca u ary, . w 3
represented y a tcs consls I/::). I I t'ons 10 arithmetic reasomng,
b 1 10 spaha re a I 0, '
containing 10 voca u ar~, ~ the latter test, there might be little or
and 10 perceptual speed Item~'dI. 'd r performance on the different
no relationship between an III IVI ua s
In this formula, rll is the reliability coefficient of the whole test, n is the
.' types of items. ill be less ambiguous when derived number of items in the test, and IJ't the standard deviation of total SCOl'es
., It is apparent that test scores w h t'. the highly heteroge-
t ts Suppose t a III on the test. The only new term in this formula, 'S.pq, is found by tabu-
from relatively homogeneo~ es S' 'th and Jones both obtain a score of lating the proportion of persons who pass (p) and the proportion who do
neous, 40-item test cited ave, rfml s of the two on this test were
e not pass (q) each item. The product of p and q is computed for each
20, Can we conclude that the Ph ormance tly completed 10 vocabulary
? N t II Smith may aye correc .. item, and these products are then added for all items, to give ~pq. Since
equal. ot a a . 's and none of the arithmetic reasomng in the ptocess of ~est construction p is often routinely recorded in order
items, 10 perceptual speed ~tem 't t Jones may have received a score
and spatial relations items, neon ras ,
3 A Simple dcrivatiolJ of this formula can be found in Ebel (1965, ppo 32!hS27).
Reliability 119
u8 Pri'lcipks of Psychological Testing
i6'find the difficulty level of each item, this method of determining rc- one case, error variance covers temporal fluctuations; in another, it refers
to differences between sets of parallel itcms; and in still another, it in-
i~bilityinvolves little additional cO,mputation. l' bT
,fIt can be shown mathematically that the Kuder-Ri~hardson r~ la Ilty cludes any interitem inconsistency. On the other hand, the factors ex-
, cient is actually the mean of aU split-half coeffiCients .resultll1~ from cluded from measures of error variance are broadly of two types: (a)
ent splittings of a test (Cronbach, 1951).4 The ordmary spht-half those factors whose variance should remain in the scores, since they are
dent, on the other hand, is based on a planned split design~d to part of the true differences under consideration; and (h) those irrelevant
equivalent sets of items. Hence, unless the test items are hIghly factors that can be experimentally controlled. For example, it is not
mogeneous, the Kuder-Richardson coefficient will be .lo\~er than t~e customary to report the error of measurement resulting when a test is
administered under distracting conditions or with a longer or shorter
lit-halfreliability. An extreme example will serve to hl.ghlight t?e dlf
erence.Suppose we construct a 50-item test out of 25 diHerent kmd~ a
f time limit than that specified in the manual. Timing errors and serious
emssuch that items 1 and 2 are vocabulary items, items 3 and 4 anth- distractions can be empirically eliminated from the testing situation.
eticreasoning, items 5 and 6 spatial orientation, a~d so on. The odd.and Hence, it is not necessary to report special reliability coefficients corre-
venscores on this test could theoretically agree qmte clos:ly, thus. YIeld- sponding to "distraction variance" or "timing variance."
'ng a high split-half reliability coefficient. The homogeneity of. thiS test, Similarly, most tcsts provide such highly standardized procedures for
ince there would be little consistency of administration and scoring that error variance attributable to these fac-
owever,wou Id be very low • S " ld tors is negligible. This is particularly true of group tests deSigned for
erformance among the entire set of 50 items. In thIS example, we wou.
'~xpectthe Kuder-Richardson reliability to be much lower th\lD th~ splIt- mass testing and computer scoring. 'With such insb'uments, we need only
halfreliability. It can be seen that the diHerence between Kuder-~Ichard- to make certain that the prescribed procedures are carefully followed
,son and split-half reliability coefficients may serve as a rough ll1dex of and adequately checked. 'Vith~clinical instruments employed in intensive
individual examinations, on the other hand, the!'e is evidence of con-
the heterogeneity of a test. . siderable "examiner variance:' Through special experimental designs, it
The Kuder-Richardson formula is applicable to tests whose Items are
scored as right or wrong, or according to some other all-or-none syste~. is possible to separate this variance from that attributable to temporal
Sometests however may have multiple-scored items. On a personahty fluctuations in the subject's condition or to the use of alternate test forms.
inventory,for exampie, the respondent may receive a di,~erent n,~~erical ~ne source of error variance that can be checked quite simply is scorer
score on an item, depending on whether he checks . usually, some- vanance. Certain types of tests-notably tests of creativity and projective
. " " I" "ne\1el'" For such tests a generahzed formula has tests of personality-leave a good deal to the judgment of the scorer.
times, rare y, or· ' . k \Vith such tests, there is as much need for a measure of scorer reliability
been derived known as coefficient alpha (Cronbach, 1951; NOVIC &
Lewis, 1967).' In this formula, the value ~pq is replaced by ~u'i, ~he sum as there is for the more usual reliability coefficients. Scorer reliability can
of the variances of item scores. The procedure is to find the vana~ce of be found by having a sample of test papers independently scored by two
all individuals' scores for each item and then to ~dd these v~na~ces examiners. The two scores thus obtained hv each examinee are then cor-
i, across all items. The complete formula for coeffiCIent alpha IS glVen related in the usual way, and the resulti~g correlation coefficient is a
measu,re of scorer reliability. This type of reliability. is commonly com-
below: puted when subjectively scored instruments are e.mployed in research.
_ (~)U't - ~U';
TlI - n- 1 u't
"»est manuals should also report it when appropriate. '
A clear description of the computational layout for finding coefficient

OVERVIEW. The diHerent types of reliability coemsiel),ts discussed in
alpha can be found in Ebel (1965, pp. 326-330).
this section are summarized in Tables 8 and 9. In Tablit18'the operations
followed in obtaining each type of reliability are classffled,-,with regard
SCORER RELIABILITY. It should now be apparent that the difIer:nt types to number of test forms and number of testing sessions required. Table 9
of reliability vary in the factors they subsume under error vananee. In shows the sources of variance treated as error vitri~nce b},;,~achprocedure.
Any reliability coefficient may be interpreted directly"in terms of the
4 This is strictly true only when the split-half coefficientsare found by the Rulon
formula,not when they are found by correlation of halves and Spearman-Brown percentage of score variance attributable to different sources. Thus, a re-
formula(Novick & LewiS, 1967). liability coefficient of .85 signifies that 85 perceI1t 9f the variance in test
Reliability 121
lZ0 Principles of Psyc11010gical Testing
efficient (\/;;-;-). When the index of reliability is squared, the result is the
TABLE 8 reliability coefficient (r1l), which can therefore be interpreted directly
Techniquesfor Measuring Reliability, in Relation to Test Form
as the percentage of true variance.
andTesting Session
Experimental designs that yield more than one type of reliability co-
Test Forms Required efficient for the same group permit the analysis of total score variance
Testing into different components. Let us consider the following hypothetical
SessionS
Required example. Forms A and B of a creativity test have been administered with
a two-month interval to 100 sixth-grade children. The resulting alternate-
Split-Half A1temate-Form
form reliability is .70. From the responses of either form, a split-half re-
Kuder-Richardson (Immediate)
liability coefficient can also be computed.6 This coefficient, stepped up by
Scorer the Spearman-Brown formula, is .80. Finally, a second scorer has rescored
Alternate- Form a l'andom sample of 50 papers, from which a scorer reliability of .92 is
Two \ Test-Retest obtained. The three reliability coefficients can now be analyzed to yield
(Delayed)
1'l..'C
••.:J";':'.:.•.;-...•
:.~io!<!'.l:::i;r~C'~<;~;;.tr..~""F.:~ ....:.Y:-:_~
:_~~,,::.;.c~.-:,~;:.:.;(;,tJ';;.!:.
4':~~ __
••'~.~;-.:.;~ .•..c..::.t,at;.;..."Ulr'&.~~')l.t;·~
•..fW"6'.!':"i·:;",- the error variances shown in Table 10 and Figure n. It will be noted that
by subtracting the en'or variance attributable to content sampling alone
(split-half reliability) from the error variance attributable to both con-
scores depends on true vati~nce in the trait measured and 15 percent
tent and time sampling (alternate-form reliability), we find that .10 of the
epends on error variance (as:'opcrationally defined by the specific pr~-
variance can be attributed to time sampling alone. Adding the error vari-
edure followed). The statistically sophisticated reader may recall that It
~nces attributable to content sampling (.20), time sampling (_10), and
's the square of a correlation coefficient that represents proportion of
mterscorer difference (.08) gives a total error variance of .38 and hence a
ommanvariance. Actually, the proportion of true variance in test scores
true variance of .62. These proportions, expressed in the more familiar
'sithe square of the correlation between scores on a single form of the
percentage terms, are shown graphically in Figure II.
est and true scores free from chance errors. This correlation, known as
th6 index of re1iabdity,~ is equal to the square root of the reliability co-
TABLE 10
Anal)'sis of Sources of Error Variance in a H}'P0thetical Test
:fABLE 9
,ourcesof Error Variance in Relation to Reliability Coefficients
From delayed alternate-form reliability: 1 - .70 = .30 (time samplin'k
plus content
Type of sampling)
Reliability
Coefficient From split-half, Spearman-Brown reliability: 1 - .SO = .20· (content
sampling)
,est-Retest Time sampling
Content sampling DiHerence .10· (time sampling)
lemale-Form(Immediate)
emale-Form(Delayed) Time sampling and Content sampling TWDl scorer reliability: 1- .92 = .OS· (interscorer
Content sampling difference )
lit-Half
er-Richardsonand Coefficient Content sampling and Total Measured Error Varianetl· =
.20 + .10 + .08 = .38
Ipha
Content heterogeneity
Interscorer differences
True Variance = 1- .38 .62 =
rer
. 6 For a better estimate of the coefficientqf internal consistency.split-half correla-
5 Derivations
of the indexof reliability,based on two dilTerentsets of assumptions, tions could be computed for each fonn amI the two coeffiCientsaveraged by the ap-
propriate statistical procedures. '-\;,. ;
\givenby Gulliksen (l950b, Chs. 2 and 3). 'II,
',,I,II ;
i
i'
Error Variance: 38'J.
that individual differences in test scores depend on speed of perform-
A_- --x.--8-'X,-"'" ance, reliability coefficients found by these methods will be spuriously
10 high. An extreme example will help to clarify this point. Let us suppose
that a 50-item test depends entirely on speed, so that individual differ-
Stable over lime; consistent over !orms; ences in score are based wholly on number of items attempted, rather
free !rom interscorer difference than on errors. Then, if individual A obtains a score of 44, he will obvi-
ously have 22 correct odd items and 22 correct even items. Similarly,
individual B, with a score of 34, will have odd and even scores of 17 and
17, respectively. Consequently, except for accidental careless errors on a
11. Percentage Distribution of Score Variance in a Hypothetical Test. few items, the correlation between odd and even scores would be perfect,
or + 1.00. Such a correlation, however, is entirely spurious and provides
no information about the reliability of the test.
An examination of the procedures followed in finding both split-half
'LIABILITY OF SPEEDED TESTS
and Kuder-Richardson reliability \:vill show that both are based on the
"
oth in test construction and in the interpretation of test scores, an consistency in number of errors made by the examinee. If, now, indi-
vidual differences in test scores depend, l~ot on errors, but on speed, the
portant distinction is that between t~e ~ea.s~rement. of speed and of
wer. A pure speed test is one in whIch md1~dual differences depend measure of reliability must obviously be based on consistency in speed
tirel\, on speed of performance. Such a test IS co~s~ructed fr~~ Items of u:ork. 'Vhen test performance depends on a combination of speed and
power, the single-trial reliability coefficient will fall below 1.00, but it
uniformly low difficulty, all of which are well wI~hm ~he. a?lhty level
the persons for whom the test is designed. The hme 1Im1t.1~made so will still be spuriously high. As long as individual differences in test
ort that no one can finish all the items. Under these conditIons, each scores are appreciably affected by speed, single-trial reliability coefficients
erson's score rcflects only the speed with which he worked. A pur~ cannot be properly interpreted.
DICeI' test, on the other hand, has a time limit long el:ough ~o permIt
'What alternative procedures are available to determine the reliability
veryone to attempt an items. The difficulty of the Items IS steeply of Significantly spl1eded tests? If the test-retest techniqu~ is applicable, it
, raded, and the test includes some items too difficult for anyone to solve, would be appropriate. Similarly, equivalent-form reliability may be
properly employed with speed tests. Split-half techniques may also be
sothat no one can get a perfect score. "
It will be noted that both speed and power tests are deSIgned to p~e-" used, provided that the split is made in terms of time rather than in
vent the achievement of perfect scores. The reason for such.a precauh~, terms of items. In other words, the half-scores must be based on sep-
is that perfect scores are indeterminate, since it is impos~lble to .knm.Y arately timed parts of the test. One way of effecting such a split is to
how much higher the individual's score would have been If m?re.l~ems, administer two eqUivalent halves of the test with separate time limits.
'ffi It items had been included, To enable each mdlVldual For example, the odd and even items may be separately printed on differ-
or more d I cu, ,,' .d d
to show fully what he is able to a~c,qm1?H,~rthe test must proVI e a e- ent pages, and each set of items given with one-half the time limit of the
. qllate ceiling, either in number o~ ~te"':iJr in. difficulty level. An..ex~ep~ entire test. Such a procedure is tantamount to administering two equiva-
lion to this rule is ,found in mastery ,Jng, as Illustrated by the cllt~no~ lent forms of the test in immediate succession. Each form, however, is
referenced tests discussed in ChaPtrc4. The purpose of such testm~ IS h¥f as long as the test proper, while the subjects' scores are normally
not to establish the limits of what th'e3hdividual can do, but to determme based on the whole test. For this reason, either the Spearman-Brown or
whether a preestablished performance level has or has not been rea.ehed. some other appropriate formula should be used to find the reliability of
In actual practice, the distinction between speed and power :ests IS ~nc the whole test.
of degree most tests depending on both powe~ and speed 111 varymg If it is not feasible to administer the two half-tests separarely, an al-
proportiO~S. Information about these proportions is needed for each test ternative procedure is to divide the total t,ime into quarters, and to find
. rder not onlv to understand what the test measures but also to a score for each of the four quarters. This caneasil~':J;>~ 'done by having
~o~se the prop~r procedures for evaluating its reliability. Single-trial the examinees mark the item on which they ar~ w6rkiti~ whenever the
reliability coefficients, such as t~ose found by odd-even or Ku.der- examiner gives a prearranged signal. The number of items correctly
Richardson techniques, are inapplicable to speeded tests. To the extent completed within the first and fourth quarters can then be combined to
Principles of PsycllOlogical Testing
'~w,' represent one half-score, while those in the second and thir~ q~artcrs TABLE 11
," can be combined to yield the other half-score. Such a combmahon of Reliability Coefficients of Four of the SRA Tesls of Primary MenIal
. quarters tends to balance out the cumulative effects of practice, fatigue, Abilities for Ages 11 to 17 (1st Edition)
and other factors. This method is especially satisfactory when the items (Data from Anastasi & Drake, 1954)
are not steeply graded in difficulty level.
When is a test appreciably speeded? Under what conditions must the Reliability Coefficient Verbal
. special precautions discussed in this section be observed? Obviously, the Found by: Meaning Reasoning Space Number
mere employment of a time limit does not signify a speed test. If all
subjects finish within the giycn time limit, speed of work plays no part Single-trial odd-even method .94 ,96 .90 .92
in determining the scores. Percentage of persons who fail to complete Separately timed halves .90 .87 .75 .83
the test might be taken as a crude index of speed versus power. Even
when no one finishes the test, however, the role of speed may be negli-
gible. For example, if everyone (<()mpletes exactly 40 items of a 50-item p~ted, the reliability of the Space test is .75, in contrast to a spuriously
.test, individual differences with regard to speed are entirely absent, al- hIgh odd-even coefficient of .90. Similarly, the reliability of the Reasoning
though no one had time to attempt all the items. te,st drops f~on~..96 to .87, and that of the Kumber test drops from .92 to
The essential question, of course, is: "To what extent are individual .8,3. The rehablhty of the relatively unspeeded Verbal Meaning test, all
differences in test scores attributable to speed?" In more technical terms, the other hand, shows a negligible difference whe'n computed by the two
we want to know what proportion of the total variance of test scores is methods.
speed variance. This proportion can be estimated roughly by finding the
... variance of number of items completed by different persons and dividing
'\ it by the variance of total test scores (u·'/r:J't). In the example cited DEPENDENCE OF RELIABILITY COEFFICIENTS
above, in which ev~ry individual finishes 40 items, the numerator of this ON THE SAMPLE TESTED
fraction would be zero, since there are no individuaL differences in num-
ber of items completed (u'(' =
0). The entire index would thus equal zero HET~ROG~XEITY. An important factor influencing the size of a reliability
in a pure power test. On the other hand, if the total test variance (U2f) coeffiCient IS the nature of the group on which reliability is measured. In
is attributable to individual differences in speed, the two variances will ~he. ~rst pla~e, any correlation coefficient is affected by the range of
.. be equal and the ratio will be 1.00. Several more refined procedures have 1I1?~\')?ual dl~erenc:~ in the group. If every member of a group were
;". been developed for determining this proportion, but their detailed con- ah~~ 111spcllmg ablhty, then the correlation of spelling with any other
sideration falls beyond the scope of this book., . '. a~lll~y would be zero in that group. It would obviously be impossible;'
An example of the effect of speed on single-trial reliability coefficients WI~~1Ilsuch a group, to predict an individual's standing in any other
is provided by data collected in an investigi~on of the first edition of ablhty from a knowledge of his spelling SCOFe.
the SRA Tests of Primary Mental Abilitie.s.~.r Ages 11 to 17 (Anastasi & Anot~er, less extreme, example is provided by the correlation between
Drake, 1954). In this study, the reliab!lijY',uf each test was first deter- tw~ aptItude tests, such as a verbal comprehenSion and an arithmetic rea-
mined by the usual odd-even procedm:e.;{~;fie~~coefficients, given in the sonmg test. If these tests were administered to a highly homogeneous
first row of Table 11, are closely sinjil Jhose reported in the test sampll:', such as a group of 300 college sophomores, the correlation be-
manual. Reliability coefficients were the ..," ,nfited by correlating scores I tween the two would probably be close to zero().There is little relation-
on separately timed halves. These coef1i~~:are shown in the second S~i~, wi~hin such a .s~lected s~mple of college students, between any in-
row of Table 11. Calculation of speed indexes showed that the Verbal dn Idual s verbal abdlty and hiS numerical reasoning abilitv. On the other
Meaning test is primarily a power teSt;,l~i1e the Reasoning test is some- hand, wer~ the test~ to. be. give.n to a hetero~neous sample of 300 per-
what more dependent on speed. The Spa.~~,and Number tests proved to sons, rangmg f~om mstItut~ona1tzed mentally retar~ed persons to college
be highly speeded. It will be noted iri;1;~h'1' 11 that, when properly com- graduates, a hIgh correlatlon would undoubted:}£,::be obtained betweep
the two tests. The mentally retarded would o~ta1.~~hoore.r:scores than tile
7 See. e.g .• Cronbach & Warrington (1951 Y,Culliksen (1950a, 1950b), Cuttman college graduates on both tests, and similar no{ . hips would hold for
(1955), Helmstadter & Ortmeyer (1953).
other subgroups within this highly heterogeneo'us ',pIe.'>
Reliability 127
mination of the hypothetical scatter diagram given in Figure 12 differences within a more homogeneous sample than the standardization
urther illustrate the dependence of correlatioll coefficients on the group, the reliabi~ity ~oefficient should be redetermined on such a sample.
Hity, or extent of individual differences, within the group. This Formulas for estimating the reliability coefficient to be expected when
r diagram shows a high positive correlation in the entire, heteroge- the standard deviation of the group is increased or decreased are avail-
s group, since the entries are closely clustered about the diagonal able in elementary statistics textbooks. It is preferable, however, to re-
ding from lower left- to upper right-hand corners. If, now, we con- compute the reliability coefficient empirically on a group comparable to
only the subgroup falling within the small rectangle in the upper that on which the test is to be used. For tests designed to cover a wide
-hand portion of the diagram, it is apparent that the correlation be- range ~f age or abil.ity, the test manual should report separate reliability
the two variables is close to zero. Individuals falling within this coeffiCIents for relatively homogeneous subgroups within the standardiza-
, icted range in both variables represent a highly homogeneous group, tion sample.
did the college sophomores mentipned above.
'ke all correlation coefficients, reliability coefficients depend on the
'iability of ,the sample within which they are found. Thus, if the re- ABILITY LEVEL. Kot only does the reliability coefficient vary with the
ility coefficient reported in a test manual was determined in a group extent of individual differences in the sample, but it may also vary be-
'ing from fourth-grade children to high school students, it cannot be tween groups differing in average ability level. These differences, more-
med that the reliability would be equally high within, let us say, an over, cannot usually be predicted or estimated by any statistical formula,
hth-grade sample. \Vhen a test is to be used to discriminate individual b~t c~n ~e' discovere~ .only by empirical tryout of the test on groups
d.dfermg 111 age or abilIty levcl. Such differences in the reliability of a
i ; I i , , smgle test may arise from the faCt that a slightlv different combination of
I , I
, ,
I
I
i
i
! I i !
I
, !, I , ,, i IIi I
I
abilities is measured at different difficulty lev~ls of the test. Or it may
I , I , I ! i I I iI/ ,
,
i I result from the statistical properties of the scale itself, as in the Stanford-
, I , I ! i ~ , I, "1'/1 11\
-'~--
, I I , ! I , " Binet (Pinneau, 1961, Ch. 5). Thus, for different ages and for different
; I i i I ill I ; 1\11'/1,/1
I , , I I IQ levels, the reliability coefficient of the Stanford-Binet varies from .83
-h': i I i
/lill,l '1/'11 Ifi'll IIi i
,
i I i 1 i i I jll'/I', :'/'1111, i ,I to .98. In other tests, reliability may be relatively low for the younger
, "
1 ! , ! I
i
i
,
i
I
I
I
!
, !
I ! , , I ~ I III
1/1 1'1/;/1/
/11/1.//, ,
/II I!I:
/I:/! I!
I
and less able ¥roups, since their scores are unduly influenced by guessing.
i ! I I ,II, , , , Under such CIrcumstances, the particular test should not be employed at
i
I !
~,
I
I
I
I
I .11 11[111/1
, 1/1 /1/1:/1 I :/1'
III /I /1;11, , I
;
i
i i
,
these levels.
!
!
I
,
,
I I ! ;1 1/11/1' I: I '11,/1 1 i
,' , I
,
I It is apparen.t t~at every reliability coefficient should be accompanied
•.
i I
, i'i il'" //I//!// 11/ II II' I I
! ! J
, by a fuD descnptIon of the type of group on which it was detelmined.
I i I : i I I !1I,II,lI/llIll/ll/ I . ;", ; I i
Special attention should be given to the variability and the ability level
i i , 1 'I;;;l;i.;: 'i , i
I
I
i i I
I
1'1'
;111/11
: I ~" 11111I11
//,/1 1/1,11/, I t~1
, i
, i
I i of the sa~~le. The reported reliability coefficient is applicable only to
I , i i , I ~amplef, s~nll]~r to that on which it was computed. A desirable and grow-
I 111 111 111/:1/!1I
; 1
I
I I 1
, , I ,/1 I II i,l I! ~I I' I

lIlg practice In test construction is to fractionate the standardization
i
, , : I I
I
/ I fll I 11/11/' ". i! I
L
, i '~?f sample into m~re homogeneous subgroups, with regard to age, sex, grade
i
,:~it· t~i
IJII I II I I
I 11·/1 I ! I : !
,
: I leve~, occupation, and the like, and to report separate reliability co-
11/ 1/1/ /I' I [I I i I effic~ents for each s~bgroup. Under these conditions, the reliability co-
I
II:W
, 11 1/1/ i I I ; I
cHicIen¥ are more lIkely to be applicable to the samples ~~th which the
I I
, II I I I
,
I 1 I
test is to be used ill actual practice. ..
," III
/I
I
II
I ..
", ..;
.'fo;.,
I
I I
I I
jfI I i .....- ... i ! I
/I I I I I !
I 1
",'·,1.
I I
Score on Variable 1
INTERPRETATION OF INDIVIDUAL SCORES. The reliability of a test may be
.Frc. 12. The Effect of Restricted Range upon a Correlation Coefficient.
expressed in terms of the standard error of measllre~ent ((fmen.,), also
Reliability U9
Principles of PsycllOlogical Testing
sign a probability to this statement for any given obtained score, we call
, called tIle standard error of a score. This measure is particularly wen
say that the statement would be correct for 99 percent of all the cases.
suited to the interpretation of individual scores, For many testing pur-
poses, it is therefore more useful than the reliability coefficient., TI~e On the basis of this reasoning, Gulliksen (1950b, pp. li-20) proposed
that the standard error of measurement be used as illustrated abo've to
, standard error of measurement can be easily computed from the rehabll-
estimate the reasonable limits of the true score for persons ""it-h any given
: ity coefficient of the test, by the following formula:
obtained score. It is in terms of such "reasonable limits" that the en-or of
measurement is customarily interpreted in psychological testing and it
will be so interpreted in this book.
.in which al is the standard deviation of the test scores and '11 the reliabil- The standard error of measurement and the reliabilitv coefficient are
ity coefficient, hath computed on the same group. For example, if devia- obviously alternative ways of exprt'ssing test reliability. Unlike the relia-
tion IQ's on a particular intelligence test have a standard devia~iol1 of ~5 bility coefficient, the error of measuren)('nt is independent of the varia-
.and a reliability coefficient of .89, the a"" ••. of an IQ on thIS test IS; bility of the group on which it is computed. Expressed in terms of indi-
=
;.15\/1- .89 15Y.ll = =
15(.33) 5. -v vidual scores, it remains unchanged when found in a homogeneous or a
. To understand what the UI/H'.' tells us about a score, let us suppose that heterogeneous group. On the other hand, being reported in score units,
. ~"wehad a set of 100 IQ's obtai~ed with the above test by a single boy, the error of measurement will not be directly comparable from test to
t;tJim,Because of the types of chance errors discussed in this chapter, these test. The usual problems of comparability of units would thus arise when
:\ scores will vary, falling into a normal distribution around Jim's true errors of measurement are reported in terms of arithmetic problems,
':score.The mean of this distribution of 100 scores can be taken as the true words in a vocabulary test, and the like. Hence, if ,,"e want to compare
,scoreand the standard deviation of the distribution can be taken as the the reliability of differetlt tests, the reliability coefficient is the better
, "11Im, • Like an\, standard deviation, this standard error can be interpreted measure. To interpret individual scores, the standard error of measure-
in t~rms of the normal curve frequencies 'discussed in Chapter 4 (see ment is more appropriatc.
Figure 3). It will be recalled that between the mean and ±lu there are
~pproximatf'ly 68 percent of the cases in, a normal curve. Th~s" we can
.nclude .h-.;-the chances arc roughly 2:1 (or 68:32) that JUllS IQ on INTERPRETATI01IO OF SCORE DIFFERENCES. It is particularly important to
, is test :_..'" 'fluctuate between ± lUIII,n.'. or 5 points on either side of his consider test reliability and errors of measurement \\'hen evaluating the
Ie IQ. If his true IQ is no, we 'V<:mldexpect him to score between 105 differellces between two scores. Thinking in terms of the range within
ld U5 about two-thirds (68 percent)' of the time. which each score may fluctuate serves as a check against overempha-
If we want to be more certain of oiI~rprediction, we can choose higher sizing small diHerences between scores. Such caution is desirable both j
'\lddsthan :2: 1. Reference to Figurei,1t~~Chapter 4 shows that ±3u covers when comparing test scores of different persons and when comparing
00.7 percent of the cases. It can be::~_sg~,~t.ainedfrom normal curve fre- the scores of the same individual in diHerent abilities. Similarly, changes
uenc)' tables that a distance of 2.58?:7.?~.·~i!4erside of the mean includes in scores following instructiun or other experimental \'ariables need to be
'actly 99 percent of the cases. II,tti{ee;:the chances are 99:1 that Jim's interpreted in the light of errors of measurement.
will fall within 2.58u ras, or (2.58)(5)
lll = 13 points, on either side of A frequent question abollt test scores concerns the individuars relative
. true IQ. We can thus state at ttte 99 percent' confidence level (with standing in different areas. Is Jane more able along verbal than along
Iy one chance of error out of l00J,:,that Jim's IQ on any single admin- numerical lines? Does Tom have more aptitude for mechanical than for
ation of the test will lie between"97 an9 123 (110 -13 and no + 13). verbal activities? If Jane scored higher on the verbal than on the nu-
''Jimwere given 100 equivalent te~ts. ilis IQ would fall outside this band merical sub tests .on an aptitude battery and Tom scored higher on the
'Valuesonly once.. mechanical than on the verbal, how sure can we be that they would still
'In actual practice, of course, we do not have the true scores, but only do so1on a retest with another form of the battery? In other words, could
e scores obtained in a single test administration. Under these circum- thc score differences have resulted merely from the chance: se)ection of
~nces,we could try to follow ~t.above reasoning in the reverse direc- specific items in the particular verbal, numerical, and mechahical tests
. If an individual's obtal,p~l.score is unlikely to deviate by more employed?
2.58O''''r ••. from his true"~ore, we could argue that his true score Because of the growing interest in the interpretation of score p'rofi.les,
lie within 2.580'n1f.B, olflis obtained score. Although we cannot as- test publishers have been developing report forms that permit the evalua-
Reliability 131
the difference between the Verbal Reasoning and Numerical Ability

:RAWSCORE
PERCENTILE
I~~l'~~'ll
60
w;;, I~~
9S
l~::'-;-1 ;~ I;; I
80"
ft .~~~
95 30 80 90 'l9 85 i
scores probably reflects a genuine difference in ability level; that bctween
Mechanical Reasoning and Space Relations probably does not; the dif-
ference between Abstract Reasoning and Mechanical Reasoning is in
the doubtful range.
It is well to bear in mind that the standard error of the difference be-
, tween two scores is larger than the error of measurement of either of the
two scores. This follows from the fact that this difference is affected by
, <;: ",. - the chance er1"Orspresent in both scores. The standard error of the diffe;-
ence between two scores can be found from the standard errors of meas-
~~\ ,
~ urement of the two scores by the follOWing formula:
- ..
'"~60 .
- -
~
~~ 50
u
.. in which Udi//. is the standard error of the difference between the two
~ 40 - - scores, and Umca8.) and Urneas .• are the standard errors of measurement of
,.
30
".
..
the separate scores. By substituting SDyll - TII for Umeus,) and

0 : .. .. ..
SDyll - r2lI for Umeas .• , we may rewrite the formula directly in terms of
0 .. .. .. reliability coefficients, as follows~
,
1 In this substitution, the same SD was used for tests 1 and 2, since their
scores would have to be expressed in terms of the same scale before they
Flc. 13. Score Profile on the Differential Aptitude Tests, Illustrating Use of could be compared.
Percentile Bands.
\Ve may illustrate the above procedllfe with the Verbal and Perform-
(Fig. 2, Fifth Edition Manual, p. 73. Reproduced b)' permission. Copyright ® 1973,
1974 by The Psychological Corporation, New York, N.Y. All rights reseT\'ed.)
ance IQ's on the Wechsler Adult Intelligence Scale (WAIS). The split-
half reliabilities of these scores are .96 and .93, respectively. WAIS devia-
tion IQ's have a mean of 100 and an SD of 15. Hence the standard error
tion of scores in terms of their errors of measurement. An example is, the of the difference between these two scores can be found as follows:
Individual Report Form for use with the Differential Aptit,~~e Tests, re-
produced in Figure 13. On this form, percentile scores ?~ each subtest Udif/. = 15y12 - .96 - .93 = 4.95
of the battery are plotted as one-inch bars, '\\1th the ~1:l~jPed percentil~ ' To determine how large a score difference could be obtained by chance
at the center. Each percentile bar corresponds to a dist~nce of approxI-
at the .05 level, we multiply the standard error of the difference (4.95)
mately 1 Y2 to 2 standard error~ o~ :ithe~' ~,ide ~f 't~i!o~t~ine? ~core.8 by 1.96. The result is 9.70, or approximately 10 points. Thus the differ-
Hence the assumption that the mdlVl~ua! s true ~~~allS Wlthm the ence between an individual's WAIS Verbal and Performance IQ should
bar is correct about 90 percent oftl,t,:.time. In iI'l~,~rp.tetingthe profiles, be at least 10 points to be significant at the .05 level.
test users are advised not to attach Importance to olfferences between
scores whose percentile bars overlap,- especially if they overlap by more
1
than half their length. In the profil~%tl!ustrated~~f~gure 13, for example, RELIABIUTY OF CRITERION-REFERENCED TESTS
·1;-~:. -, ,
8 Because the reliability coefficient (a¥d hence th~ er•• , ••. ) varies somewhat with
subtest, grade, and sex. the actual ranges covered by the one-inch lines are not It will be recalled from Chapter 4 that criterion-referenced, tests usu-
identical, but they are sufficiently close to permit uniform interpretations for practical ally (but not necessarily) evaluate performance in terms o(~ mastery
purposes. rather than degree of achievement. A major statistical implication of
13Z Pl'inciplt:s of Psychological Tcstillg
Reliability 133
mastery testing is a reduction in yariability of scores among persons.
ce~ures ar~ feasible and can reduce total testing time while yielding
Theoretically, if everyone continues training until the skill is mastered, rehable ~stlma.tes of mastery (Glaser & Kitko, 1971).
variability is reduced to zero. Not only is low variability a result of the Some Investigators have been explorinO' the use of Ban'sian estimation
way such tests are used; it is also built into the tests through the con- techniques, whi.eh lend themselves well t~ the kind of decisions required
struction and choice of items, as will be shown in Chapter 8. by, ma~tery testmg. Because of the large number of specific instructional
In an earlier section of this chapter, we saw that any correlation, in- objectives to bc t~sted, criterion· referenced tests typically provide only a
cluding reliability coefficients, is affected by the variability of the group small number of Itcms for cach objective. To supplement this limited in-
in which it is computed. As the vatiability of the sample decreases, so formation, procedures have been developed for incorporatinO' collateral
does the correlation coefficient. Obviously, then, it would be inappropri- data from the student's previous performance history, as well ~s from the
ate to assess the reliahilitv of most criterion-referenced tests by the usual test results of other students (Ferguson & !'\oviek, 197:3; Hambleton &
procedures.o Under thes; conditions, even a highly stable and internally Novick, 1973).
consistent tcst could yield a reliability coefficient near zero. When flexible, individually tailored procedmes are impracticable,
In the construction of criterion-referenced tests, two important ques- I~ore traditional techniques can be utilized to assess the reliability of a
tions are: (1) How many items must be used for reliable assessment of gl\'en .test. For example, mastery decisions reached at a prerequisite in-
each of the specific instructional objectives covered by the test? (2) "What structional level can be che{:ked against performance at the next instruc-
proportion of items must be correct for the reliable establishment of tional level. Is there a sizeable proportion of students who reached or
mastery? In much current testing, these two questions have been an- exceeded the cutoff score on tIle masten' test at .the lower level and
swered by judgmental decisions. Efforts are under way, however, to de- ~ailed t~ achi~\'e mastery at the next levei within a reasonable period of
velop appropriate statistical techniques that will provide objective, em- mstructlOnal tU1W?Does an analysis of their difficulties suggest that they
pirical answers (see, e.g., Ferguson & i\ovick, 1973; Glaser & Nitka, 1971; had not truly mastered the prerequisite skiIIs:l If so, these findings would
Hambleton & l\ovick, 1973; Livingston, 19i2; Millman, 1974). A few strongly suggest that the mastery test was unreliable. Either the addi-
examples will serve to illustrate the nature and scope of these efforts. tion of more items or the establishment of a higher cutoff score would
The t,,'o question~ about number of items and cutoff score can be in- seem to be indicated. Another procedure for determining the reliability
corporated into a single hypothesis, amenable to ~testillg within the of a master)' test is to administer two parallel forms to the same indi-
framework of decision theory and sequential analysis (Glaser & :\'itko, viduals and note the percentage of persons for ",hom the same decision
197]; Lindgren & :'.1cElrath, 1969; Wald, 1947). Specifically, we wish to (~mstery or nonmastery) is reached on both forms (Hambleton & No-
test the hypothesis that the examinee has achieved the required le"el of \'Ick, ] 973 ).
mastery in tllP content domain or instructional objective sampled by tne In the development of several criterion-referenced tests, Educational
test items. Segucntial analysis consists in taking observations one at a Testing Service has followed an empirical procedure to set standards of
timE' and deciding after cach observation whC'f.tper to: (1) accept the mastery. This procedure involves administering the test in classes one
hypothesis; (2) rejcct the hypothes!s; or .(3~f~~ake add~tional o~serYa- grade below and one grade above the grade where the particular con-
tlOns. Thus the number of observations (m.;fhls case :t:lytnber of items) ce?t or skill i~ taught. The dichotomization can be fmther rcGned by
needed to reach a reliable conclusion is, itself deten~nined during the usmg teacher Judgments to exclude any cases in the lower grade knoVl'll
process of testing. Rather than being p.::fls~nted,,:ith a fixed, prede- to have mastered the concept or skill and any cases in the higher grade
termined number of items the examine~~c;;dntimieS;ltaking tbe test until who have demonstrably failed to master it. A cutting score, in terms of
a mastery or nonmastery d~cision is r~·.·" ·'d. At that point, testing is dis- number or percentage of correct items, is then selected that best dis-
cuntinue'd and the student is either dire '.,:fo~he next instructional level criminates between the two groups. .
or returned to the nonmastered level '0 ; further study. \Vith the com- Allstatistical procedures for use with criterion-referenced tests are in
puter facilities described in Chaptn_~, such sequential decision pro- an exploratory stage. Much remains to be done, in both theoretical de-
9 For fuller discussionof special statistic;\"~rocedures required for the construction veloIJ!nent and ~mpir.ical ~ryouts, before the most effective IJlethodology
and evaluationof criterion-referencedtests,see Glaser and Nitko (1971), Hambleton for different testmg situatlons can be formulated. 4
and Novick (1973), Millman (1974), Popham and Husek (1969). A set of tables
for determining the minimum number of ~lems required for establishing mastery at
speCifiedlevels is provided by Millman (1972,1973).
Validity: Basic Concepts 135
sample of the behavior domain to be measured. Such a validation -pro-

HAPTER 6 cedure is commonly used in evaluating achievement tests .. This type of
test is designed to measure how well the individual has mastered a
specific skill or course of study. It might thus appear that mere inspec-
.alidity: tion of the content of the test should suffice to establish its "a1idih' for
such a purpose. A test of multiplication, spelling, or bookkeeping '~'ould
seem to be valid by definition if it consists of multiplication, spelling, or
bookkeeping items, respectively.
.;Basic C011cepts The solution, however, is not so simple as it appears to be.' Onc diffi-
culty is that of adequately sampling the item universe. The behavior do-
main to be tested must be systematically analyzed to make certain that
aJJ major aspects are covered by the test iteme;. and in the correct pro-
·HE VALIDlTY of a test concerns u;lwf the test measures and how
~r example, a test can easily become overloaded with those
aspects of the field that lend thcmselves more readily to the pl'eparation
,
, T wen it does so. In this connection, we should guard against ae-
cepting the test name as an index of .what. the ~est measures. Test
names provide short, convenient labels for IdentificatIon purposes. Most
of objective items. The domain under consideration should be fully de-
scribed in advance, rather than being defined after the test has been pre-
pared. A \VeIl-constructed achievement test should cover the objectives of
test names are far too broad and vague to furnish meaningful clues to the instruction, not just its subject matter. Content must therefore be broadly
behavior area covered, although increasing e£forts are being made to use defined to include major objectives, such as the application of principles
more specific and operationally definable test names. ~he ~rait measured and the interpretation of data,~ as well as factual knowledge. ~vloreover,
by a given test can be defined only through an e~amIna~l~n of. the ob- content validity depends on the relevance of the individual's test re-
jective sources of information and empirical operatIOns ut~li~ed In estab- sponses to the behavior area under consideration, rather than on the
lishing its validity (Anastasi, 1950). Moreover, the vahdlty of ,a .tes; apparent rcle\'ance of item content. Mere inspection of the test may fail
cannot be reported in general terms. No test can be said to ha.ve 'hl~h to reveal the processes actually used by examinces in taking the test.
or "low" validitv in the abstract. Its validity must be determmed WIth It is. also important to guard against any tendency to overgeneralize
reference to the' particular use for, which the test is being considered. regarding the domain sampled by the test. For instance, a multiple-choice
Fundamentallv all procedures for determining test validity are con- spelling test may measure the ability to recognize correctly and incor-
cerned with the 'r~lationships between performance on the test and other rectly spelled worde;. But it cannot be assumed that such a test also
independently observable facts about the behavio~ ehar~cte~stics under measures ability to spell correctly from dictation, frequency of misspell-
consideration. The specific methods ·employed for mvestIgatmg these re- ings in written compositions, and other aspects of spelling ability (Ahl-
lationships are numerous and have been descri~ed by various names. In strom, 1964; Knoell & Harris, 1952). Still another difficulty arises from
the Standards for Educational and PsycJlOloglcal Tests,' (1974), these the possible inclusion of irrelevant factors in the test scores. For example,
procedures are classified under three prineip~~"categories: c~l1t~nt, a test designed to measure proficiency in such areas as mathematics or
criterion-related, and construct validity. Each o~ tnese types of valIdatIon, mechanics may be unduly influenced bv the ability to understand verbal
procedures will be considered in one of the .fgll?c'~ir:!g.section~, and the directions or by speed o{performing si~ple, routi~e tasks.
relations amona them will be examined in,~ .concludmg section. Tech-
niques for analyzing and intcrpreting vali1~tt "data with reference to
practical decisions will be discussed in Chapter 7. SPECIFIC PROCEDURES. Content validity is built into a test from the out-
set through the choice of appropriate' items. For educational tests, the
prepfaration of items is preceded by a thorough and systematic examina-
ti'Qn of relevant course syllabi and textbooks, as well as by consultation
I Further discussions of content validity from several angles ca,n be found in Ebel
NATURE. Content validity involves essentially the systematic exami~a-
(1956), Huddleston (1956), and Lennon (1956). .
tion of the test content to determine whether it covers a representative
with subject-matter experts. On the basis of the information thus gath-

'-ered,test specifications are drawn up for the item writers. These specifi-
cations should show the content areas or topics to be covered, the instruc-
JeqwnN wall ~N'" .••LllCO •.... ..,'" ~::~"'.,.'" "'~~~
--~
•.... "" "'o~
~NN "''''.,.
"INN
Lll"''''0:>0>0
NNN NNM
U!Pn&S IU!~OS
'onal objectives or processes to be tested, and the relative importance of
'ndividual topics and processes. On this basis, the number of items of ~
'u; a3u:a!3S
ach kind to be prepared on each topic can be established. A convenient :>
-,
is~
ay to set up such specifications is in terms of a two-way table, with ~ samuewnH
ocesses across the top and topics in the left-hand column (see Table " " "
,eh. 14). Not all cells in such a table, of course, need to have items, 3A!leJJeN iI
,nee certain processes may be unsuitable or irrelevant for certain topics. I " ""
t might be added that such a specification table will also prove helpful 5Cl'lpn~s le!XlS
. the preparation of teacher-made examinations for classroom use in any '0
"" "
ubject. is eou81OS x
~Jn listing objectives to be co\'ered in an educational achievement test, a"
.• .,f!
'p
" " "" x
u;'C
e test constructor can be guided by the extensive survey of educational :6- S3!l!UllWnH ;
jectives given in the Taxonomy of ~ducational Objectives (Bloom ~

a!., 1956; Krathwohl et al., 1964), Prepared by a group of specialists 'M.!leJJC!N
educational measurement, this handbook also provides examples of

I"
" "
any types of items designed to test each objective. Two volumes are sa!P01S II?POS
ilable, covering cognitive and affective domains, respectively. The "

jor categories given in the cognitive domain include knowledge (in
"
0
'u; a:>Ua!~S x
Iii
.r. "" " " "
sense of remembered facts, terms, methods, principles, etc.), compre- E
0.
5a!)!lJcwnH
sion,application, analysis, synthesis, and evaluation. The classification E
0 " " x
I,)
affective objectives, concerned with the modification of attitudes, in- ,
a,,!~eJJeN
rests, values, and appreciation, includes five major categories: recciv-
" "
'g, responding, yaluing, organization, and characterization. :
IThediscussion of content validity in the manual of an achievement test llj5!l:f%
6 oiIpt'!JE)
CONO
coco", .,. ... .,.
LllMcn "'N~
•.•.. "I•.•..
'" •...•....•....
MLll CO.,.
<0 "'''' •... """-CX) r--.lllCO
Nmq-10l!:t~LO
"''''''' "''''''''
.,.CO'" MNO
<t ••• .,.
"'''I'''
uld include information on th~ content areas and the skills or ob- '"~
.~- ;
ives covered bv the test, with some indication of the number of items £~
ach category. 'In addition, the procedures followed in selecting cate- ~ 0
~Z
'4D!1l%
gaP'!J~! ~ "'.,.-
~~.:rl ""'"
"'''' ..•. •...."'N.,.
.•••.•..
M
•.•.,.to COOOlco "'ON
", ••• N"'.,. ~~filll!'l~;l;
0 ~~g
••..
'"
.....~-
LllNN
, s and classifying items should be described. If subject-matter experts

ipated in the test-construction process, their number and pro- 11l5!1l~
L ape.!)
f....-CD
•.•..<0.,. "I"''''
"'''''''
"'.,.-
"'''I'''
coo",
co •••.
'"
"'Oeo
"'''''''
-~•...
N.,.M "'~Lll
NM.,. N"'<t COOlN "'0>..-
lal qualifications should be stated. If they served as judges in classi- , "l"'''' "--N "INN
items, the directions they were given should be reported, as well as JaqwnN wall -N'" "''''(0 "'COOl O~N
~~~ ~~~ "'O~ "1M.,.
"''''
•.... "''''0
extent of agreement among judges. Because curricula and course ~NN "'NN N"'''' NNM
eilt change over time, it is paI:tJcularly desirable to give the dates

n subject-matter experts were' consulted. Information should like-
be provided about number and nature of 'course syllabi and text-
s surveyed, including publication dates.
umber of empirical procedures may also be followed in order to
ement the content validation of an achievement test. Both total
s and performance on individual items can be checked for grade
ess.In general, those items are retained that show the largest gains
percentages of children passing them from the lower to the upper
Principles of Psychological Testing Validity: Basic Concepts 1.39
es.Figure 14 shows a portion of a table from the manual of the into the initial stages of constructing any test, eventual validation of apti-
ential Tests of Educational Progress-Series II (STEP). For every tude or personality tests requires empirical verification by the procedures
. in each test in this achievement battery, the information provided to be described in the following sections. These tests bear less intrinsic
des its classification with regard to learning skill and type of ma- resemblance to the behavior domain they are trying to sample than do
l,as well as the percentage of children in the normative sample who achievement tests. Consequently, the content of aptitude and personality
the right answer to the item in each of the grades for which that tests can do little more than reveal the hypotheses that led the test con-
of the test is designed. The 30 items included in Figure 14 repre- structor to choose a certain type of content for measuring a specified
t onepart of the Reading test for Level 3, which covers grades 7 to 9. trait. Such hypotheses need to be empirically confirmed to estabiish the
ther supplementary procedures that may be employed, when ap- validity of the test.
priate, include analyses of t~l)es of errors commonly made on a test Unlike achievement tests, aptitude and personality tests are not based
observation of the work methods employed by examinees. The latter on a specified course of instruction or uniform set of prior experiences
ld be done by testing students individually with instructions to "think from which test content can be drawn. Hence, in the latter tests, indi-
ud" while ,solving each problem. The contribution of speed can be viduals are likely to vary more in the work methods or psycholOgical
ckedby noting how many persons fail to finish the test or by one of processes employed in responding to the same test items. The identical
e more refined methods discussed in Chapter 5. To detect the possible test might thus measure different functions in different persons. Under
irrelevantinfluence of ability to read instructions on test performance, these conditions, it would be virtually impossible to determine the psy-
,~res on the test can be ~rrelated \",ith scores on a reading compre- chological functions measured by the tcst from an inspection of its
nsiontest. On the other hand, if the test is designed to measure read- content. For example, college graduates might solve a problem in verbal
g comprehension, giving the questions v.oithout the reading passage on or mathematical terms, while a,mechanic would arrive at the same solu-
hich they are based will show how many could be answered simply tion in terms of spatial visualization. Or a test measuring arithmetic
fromthe examinees' prior information or other irrelevant cues. reasoning among high scho.ol freshmen might measure only individual
differences in speed of computation when given to college" students. A
specific illustration of the dangers of relying on content analysis of apti-
APPLICATIONS. Especially when bolstered by such empirical checks as tude tests is provided by a study conducted with a digit-symbol substitu-
thoseilIusb'ated above, content vali,dity provides an adequate technique tion ~est"(Burik, 1950). This test, generally regarded as a typical "code-
forevaluating achievement tests. It permits us to answer two questions learmng test, was found to measure chiefly motor speed in a group of
ihat are basic to the validitv of an achievement test; (1) Does the test high school students.
'cover a representative sa~ple of the speCified skills and knowledge?
(2) Is test performance reasonably free from the influence of irrelevant
; \Janables? FACE "ALIDITY. Content validitv should not be confused with face va-
~. Content validity is particularly appropriate for the criterion-refer~n~d lidity. The latter is not validity 'in the technical sense; it refers, not to
.. testsdescribed in Chapter 4. Because performance on these tests lS 111- what the test actually measures, but to what it appears superficially to
f .terpreted in tern1S of content meaning, it is obvious that content validity measure. Face validity pertains to whether the test "looks valid" to the
~ is a prime requiremenf for their effective use. Content validation is also examinees who take it, the administrative personnel who decide on its
· applicable to certain occupational tests designed for employee selection use, and other technically untrained observers. Fundamentally, the ques-
and classification, to .be discussed in Chapter 15. This type of validation tion of face validity concerns rapport and public relations. Although
issuitable when the test is an actual job sample or otherwise calls for the common usage of the term validity in tlhs connection may make for
sameskills and knowledge required on the job. In such cases, a thorough confusion, face validity itself is a desirable feature of tests. For example,
· job analysis should be carried out in order to demonstrate the close re- when tests originally designed for children and developed within a class-
· semblance between the job activities and the test. room setting were grst extended for adult use, they frequently met with
For aptitude and personality tests, on the other hand, content validity ~esistance and criticism because of their lack of face validity. Certainly
is usually inappropriate and may, in fact, be misleading. Although con- if test content appears irrelevant, inappropriate, silly, or childish, the
siderations of relevance and effectiveness of content must obviously enter result will be poor cooperation, regardless of the actual validity of the
140 Principles of Psychological Testing Validity: Basic Concepts 141
~st.Especially in adult testing, it is not sufficie~t for a t~st to. be ob- sonnel to occupational training programs represent examples of the sort
ctivelyvalid. It also needs face validity to function effectively In prac- of decisions requiring a knowledge of the predictive validity of tests.
oal situations. Other examples include the use of tests to screen out applicants likely
.Face validity can often be improved by merely reformulating test to develop emotional disorders in stressful environments and the use of
msin terms that appear relevant and plausible in the particular setting tests to identify psychiatric patients most likely to benefit from a par-
whichthe" will be used. For example, if a test of simple arithmetic ticular therapy.
soningis 'constructed for use with machinists, the items should be In .a number of instances, concurrent validity is found merely as a
ded in tcrms of machine operations rather than in terms of "how su~stJt~te for predictive validity. It is frequently impracticable to extend
y oranges can be purchased for 36 cents" or other traditional school- vah~atlon ~rocedures over the time required for predictive validity or to
k problems. Similarly, an arithmetic test for naval personnel can be o~tam a s~Itable preselection sample for testing purposes. As a compro-
ressedin naval terminology, without necessarily altering the functions m~se .solutIOn, therefore, tests are administered to a group on whom
asured.To be sure, face validity should never be regarded as a substi- cntenon data are already available. Thus, the test scores of college
e for objectively determined validity. It cannot be assumed that im- stud~nts may b~ compared with their cumulative grade-point average at
\1ng the face validity of a test '\vill improve its objective validity. ~he tIme of testmg, or those of employees compared with their current
r can it be assumed that when a test is modified so as to increase its Job success.
e validity,its objective validity remains unaltered. The validity of the For certain uses of psychological tests, on the other hand, concurrent
in its final form will always need to be directly checked. validity ~sthe ~~st ~pprop!iate type and can be justified in its own right.
The logICal dI~tinchon between predictive and concurrent validity is
?ased, not on hme, but on the objectives of testing. Concurrent validity
ISrel.ev~nt to tests employed for diagnosis of existing status, rather than
predIction of future outcomes. The difference can be illustrated bv ask-
riterion-relatedvalidity indicates the effectiveness of a test in predicting: "Is Smith neurotic?" (concurrent validity) and "Is Smith lik"ely to
an individual's beha\'ior in specified situations. For t~is purpose, per- become neurotic::>"(predictive validity) . .
anceon the test is checked against a criterion, i.e., a direct and in- . Because ~he criterion for concurrent validity is always available at the
dent measure of that which the test is deSigned to predict. Thus, hme of testmg, we might ask what function is served bv the test in such
mechanical aptitude test, the criterion might bc subsequent job situa~ions. B~sicalIy, such tests provide a simpler, quicker, or less ex-
ormanceas a machinist; for a scholastic aptitude test, it might be ~ensive subs.htute for the criterion data. For example, if the criterion con-
ge grades; and for a neuroticism test, it might be associates' ratings SIStsof continuous observation of a patient during a two-week hospital- ,
..her available information on the subjects' behavior in various life ization period, a test that could sort out normals from neurotic and '
lions. ?oubtful cases would appreciably reduce the number of persons requir-
mg such extensive observation.
'CURREI'.:T AND PREDICTIVE VALIDITY. The criterion measure against

test scores are validated may be obtained at approximately the • ~RITERION CO~TAMINATION. An essential precaution in finding the va-
. time as the test scores or after a stated interval. The APA test hdlty of a test IS to make certain that the test scores do not themselves
·urds (1974), differentiate between concurrent and predictive valid- influence any individ~ars c~terion. status. For example, if a college ill<-
the basis of these time relations between criterion and test. The st.metor or a foreman III an mdustnal plant knows that a particularillai~
rediction"can be used in the broader sense, to refer to prediction VIdual scored very p~rly on an aptitude test, such lcIl,owl~qgemight in-
he test to any criterion situation, or in the more limited sense of fluence the gr~de gIVen to the student or the rating assigned to the
'on over a time interval. It is in the latter sense that it is used in worker. Or a hIgh-scoring person might be given the benefit of the doubt
ression"'predictive validity:' The information provided by pre- ~hen academic grades or on-the-job ratings are being prepared. Such
validityis most relevant to tests used in the selection and das- mHuences would obviously raise the correlation between test scores and
n of personnel. Hiring job applicants, selecting students for crite~on in ~ manner that is entirely spurious' or <ilrtificia1:;. .'
onto college or professional schools, and assigning military per- TIus pOSSIblesource of error in test validation is known as criterion
rillciplesof Psychological Testing
selected group than elementary school graduates, the relation between
tion, since the criterion ratings become "contaminated" by the amount of education and scholastic a titnde is far from erEect. Espe-
owledgeof the test scores. To prevent the operation of such an cIa y at t e Ig er e ucationallevels, economic, social, motivational, and
s absolutely essential that no person who participates in the as- other nonintellectual factors may influence the continuation of the indi-
of criterion ratings have any knowledge of the examinees' test vidual's education. Moreover, with such concurrent validation it is diffi-
or this reason, test scores employed in "testing the test" must cult to disentangle cause-and-effect relations. To what extent ~re the ob-
rictlyconfidential. It is sometimes difficult to convince teachers, tain~d differences in intelligence test scores simply the result of the
s, military officers, and other line personnel that such a precau- yarymg amount of education? And to what extent could the test have
ential. In their urgency to utilize all available information for predicted individual differences in subsequent educational progress?
decisions,such persons may fail to realize that the test scores These questions can be answered only when the test is administered be-
e put aside until the criterion data mature and validity can be fore the criterion data have matured, as in predictive validation.
d, I.n t~e development of special aptitude tests, a frequent type of cri-
teno~ is bas~d on performance in specialized training. For example, me-
chamcal aptitude tests may be validated against final achievement in
MON CRiTERIA. Any test may be validated against as many criteria sho~ courses. Various business school courses, such as stenographY,
e are specific uses for it. Any method for assessing behavior in t~l~g, or bookkeeping, provide criteria for aptitude tests in these area's.
tion could provide a criterion measure for some particular pur- SlIl~Ilarly,p~rformance in music or art schools has been employed in vali-
he criteria employed in £ndif\g the validities reported in test datmg musIc. or art. aptitude tests. Several professional aptitude tests
Is,however, fall into a few common categories. Among the criteria have been validated In terms of achievement in schools of law medicine
equendyemployed in validating intelligence test~ is some index of dentistry, engineering, and oth;r areas. In the case of custom-:nade tests'
ic ac ' t. It is for this reason that such tests have often deSigned for use within a specific testing program, training reco;ds are ~
ore precisely described as measures of scholastic aptitude,. The f:equent ~ource of ~riterion data. An outstanding illustration is the valida-
cindicesused as criterion measures include school grades, achieve- hO~ ~f Au Force pIlot selection tests against performance in basic flight
est scores, promotion and graduation records, special honors and tr~m~g. Performance in training programs is also commonly used as a
as, and teachers' or instructors' ratings for "intelligence." Insofar as
~ntenon ~or test validation in other military occupational specialties and
ratings given within an acade~ic setting are likely to be heavily m some mdustrial validation studies.
~dby the individual's scholastic performance, they may be properly ~mong the specific indices of training performance employed for cri-
ed with the criterion of academic achievement. tenon purposes may be mentioned achievement tests administered on
e various indices of academic achievement have provided criterion
.completion of training, formally assigned grades, instructors' ratings, and
at all educational levels, from the primary grades to college and succ~ssful co~pletjon of. training versus elimination from the program.
uateschool. Although employed principally in the validation of gen- l\ful~lple .aptItude battenes have often been checked against grades in
intelligence tests, they have also served as criteria for certain spec,IRehIg? school or college courses, in order to determine their validity
'pIe-aptitude and personahty tests. In the validation of any of these as dIfferential predictors. For example, scores on a verbal comprehension
s.oftests for use in the selection of college students, for example, a test may be compared with grades in English courses spatial visualiza-
:~on criterion is freshman grade-point average. This measure is the tion scores with geometry grades, and so forth. '
.age grade in all courses taken during the freshman year, each grade In connection with the use of training records in general as criterion
g weighted by the number of course points for which it w.a~~ceived. measures, a useful distinction is that between intermediate and ultimate
variant of the criterion of academic achievement frequenl:ly em- criteria: In the development of an Air Force pi!Pt-selection test or a medi-
edwith out-of-school adults is the amount of education the individual cal aptItude test, for example, the ultimate criteria would be combat
pleted. It is expected that in general the more intelligent individuals perfo~mance a~d eventual achievement as a practicing physician, re-
inutl their education longer, while the less inte.lli ent drop out of
spectIvely. ObVIOuslyit would require a long time for such criterion data
01earlier. The assumption underlying this crite . that the educa- to mature. It is doubtful, moreover, whether il~ly ultimate criterion is
al ladder serves as a progressively selective nee, eliminating ever obtained in actual practice. Finally, even were such an ultimate
oseincapable of continuing beyond each step. Although it is undoubt- criterion available, it would probably be subject to many unconttolled
ly true that college graduates, for example, represent a more highly
4 Principles of Psychological Testing Validity: Basic Concepts 145
. tors that would render it relatively useless. For example, it would be . The method o~ contrasted groups is used quite commonly in the valida-
cult to evaluate the relative degree of success of physicians practicing hon of persollahty tests. Thus, in validating a test of social traits, the
erent specialties and in different parts of the country. F'or these rea- test perform~nce of salesmen or executives, on the one hand, may be
s, such intermediate criteria as performance records at some stage of compar~d WIth that of clerks or engineers, on the other. The assmnption'
iningare frequently employed.as criterion measures. underlymg such a procedure is that, with reference to man v social traits
or many purposes, the most satisfactory type of criterion measure is individua.ls who hav~ entered and remained in such occupatiq9~:~s selling
t based on follow-up records of actual ;ob performance. This criterion or executive work Will as a group excel persons in such fiela~['ils clerical
been used to some extent in the validation of general intelligence as work or engine.ering. Similarly, college students who hav~>:engaged in
as personality tests, and to a larger extent in the validation of special man~ .extracl~rncular activities may be compared with those who have
tude tests. It is a common criterion in the validation of custom-made partlcIp~ted 111 nOlle during a comparable period of college attendance.
. for specific jobs. The "jobs" in question may vary widely in both Oc~up~tlOnaI groups have frequently becn used in the development and
I and kind, including work in business, industry, the professions, and vahdahon ?f interest tests, such as the Strong Vocational Interest Blank,
armed services. Most measures of job performance, although prob- as well as ~n the preparation of attitude scales. Other groups sometimes
not representing ultimate criteria, at least provide good inter- employed. m the validation of attitude scales include political, religious,
iate criteria for many testing purposes. In this respect they are to be ~eograp~lCal, or ot~er spccial groups generally knO\vn to represent dis-
erred to training records. On the other hand, the measurement of tmetly dIfferent pomts of "iew on certain issues.
perform;mce does not permit as much uniformity of conditions as is In the developmc.nt of certain personality t~sts, psychiatric diagnosis is
, ible during training. Moreover, since it usually involves a l?nger used both as a basIS for the selection of items and as evidence of test
low-up,the criterion of job puformance is likely to entail a loss m the v~lidity, Ps.y~hiatric diagnosis may serve as a satisfactory criterion pro-
mber of available subjects. Because of the variation in the nature of VIded that It IS based on prolonged observation and detailed case history
inally .similar jobs in different organizations, test manuals reporting rather than on a cursory psychiatric interview or examination. In th;
'ditydata against job criteria should describe not only tbe specific latter. case, there is no reason to expect the psychiatric diagnosis to be
terion measures employed but also the job duti~s performed by the supenor to the test score itself as an indication of the individual's emo-
rkers. tion~l ~ondition. Such a psychiatric diagnosis could not be regarded as
Validation by the method of contrasted groups generally involves a ~ c:ltenon measure, but rather as an indicator or predictor whose own va-
compositecriterion that reflects the cumulative and uncontrolled selective lidIty would have to be determined.
j~fluencesof everyday life. This critcrion is ultimately based on survi"al Mention has already been made, in connection with other criterion
,withina particular group versus elimination therefrom. For example,. ip cate,?o~ies, of certain types of ratings by school teachers, instructoml in
e validation of an intelligence test, the scores obtained by institution~l- speclahzed. cou~s.es, an~ jo~ supervisors. To these can be added ratings
, ed mentally retarded children may be compared with those obtained by offic~rs 10 mIhtary sltuahons, ratings of students by school counselors,
y schoolchildren of the same age. In this case, the multiplicity of factors and ratmgs by co-workers, classmates, fellow club-members and other
etermining commitment to an institution for the mentally retarded con- grou?~ of associ~tes. The ratings discussed earlier represent~d merely a
itutes the criterion. Similarly, the validity of a musical aptitude or a SUhsldI~ry tec?mque for obtaining information regarding such criteria as
echanical aptitude test may be checked by comparing the scores ob- academiC achIevement, performance in specialized training, or job suc-
tained by students enrolled in a music school or an engineering school, ce~s. :Ve are now considering the use of ratings as the very core of the
respectively,with the scores of unselected high school or college student~. cntenon mea~ur~. Under these circuwstances, the ratings themselves
, To be sure, contrasted groups can be selected on the basis of any cn- define the CrItenon. Moreover, such.:ratings are not restricted to the
terion, such as school grades, ratings, or job performa!1ce, by simply evaluation of speci~c achievement, but involve a personal judgment by
choosingthe extremes of the distribution of criterion me~sures. The con- an observer regardmg any of the variety;of traits that psychological tests
trasted groups included in the present category, ho}'-'wer, are disti?ct attempt to measure. Thus, the subjects in the vali~\ltion sample might be
groupsthat have gradually become differentiated through the operation ~ate? on such c?aracteristics as dominance, mech~ll.icaI ingenuity, orig-
ofthe multiple demands of daily living. The criterion under cons~dera- mali~, leadership, or honesty.':":"
lionis thus more complex and less clearly definable than those preVIously Ratings have bee~ employed in the valid~tion of,lltmost every type of
discussed. test. They are partICularly useful in providing criteria for personality
ril1ciplesof Psychological Testing
selected group than elementary school graduates, the relation between
ation,since the criterion ratings become "contaminated" by the amount of education and scholastic a titude is far from erfect. Espe-
owledgeof the test scores. To prevent the operation of such an cIa y at t e Ig er e ucationallevels, economic, social, motivational, and
.'s absolutely essential that no person who participates in the as- other nonintellectual factors may influence the continuation of the indi-
t of criterion ratings have any knowledge of the examinees' test vidual's education. Moreover, with such concurrent validation it is diffi-
or this reason, test scores employed in "testing the test" must cult to disentangle cause-and-effect relations. To what extent ~re the ob-
rictlyconfidential. It is sometimes difficult to convince teachers, tain~d differences in intelligence test scores simply the result of the
s, military officers, and other line personnel that such a precau- yarymg amount of education? And to what extent could the test have
ential. In their urgency to utilize all available information for predicted individual differences in subsequent educational progress?
decisions,such persons may fail to realize that the test scores These questions can be answered only when the test is administered be-
e put aside until the criterion data mature and validity can be fore the criterion data have matured, as in predictive validation.
d. I.n t~e development of special aptitude tests, a frequent type of cri-
teno~ is bas~d on performance in specialized training. For example, me-
chamcal aphtude tests may be validated against final achievement in
MON CRiTERIA. Any test may be validated against as many criteria sho~ courses. Various business school courses, such as stenography,
e are specific uses for it. Any method for assessing behavior in t~l~g, or bookkeeping, provide criteria for aptitude tests in these areas.
ation could provide a criterion measure for some particular pur- SlIl~Ilarly,p~rformance in music or art schools has been employed in vali-
he criteria employed in finding the validities reported in test datmg musIC,or art. aptitude tests. Several professional aptitude tests
Is, however, fall into a few common categories. Among the criteria have been validated m terms otachievement in schools of law, medicine,
equentlyemployed in validating intelligence test~ is some index of dentistry, engineering, and other areas. In the case of custom-made tests
ic ac . t. It is for this reason that such tests have often designed for use within a specific testing program, training reco;ds are ~
ore precisely described as measures of scholastic aptitude .. The f:equent ~ource of ~riterion data. An outstanding illustration is the valida-
cindicesused as criterion measures include school grades, achieve- hO~ ?f Alr Force pllot selection tests against performance in basic flight
est scores, promotion and graduation records, special honors and tr~m~g. Performance in training programs is also commonly used as a
as, and teachers' or instructors' ratings for "intelligence." Insofar as ~ntenon for test validation in other military occupational specialties and
ratings given within an acade~ic setting are likely to be heavily m some industrial validation studies.
,edby the individual's scholastic performance, they may be properly ~mong the specific indices of training performance employed for cri-
. ed with the criterion of academic achievement. tenon purposes may be mentioned achievement tests administered on
e various indices of academic achievement have provided criterion
completion of train,ing, formally assigned grades, instructors' ratings, and
at all educational levels, from the primary grades to college and succ~ssful co~plehon of. training versus elimination from the program .
.uateschool. Although employed principally in the validation of gen- l\ful~lple .aphtude battenes have often been checked against grades in
.intelligence tests, they have also served as criteria for certain spec.,fi~hlg~ school or college courses, in order to determine their validity
tiplc-aptitude and personality tests. In the validation of any of these as dIfferential predictors. For example, scores on a verbal comprehension
of tests for use in the selection of college students, for example, a test may be compared with grades in English courses spatial visualiza-
on criterion is freshman grade-point average. This measure is the tion scores with geometry grades, and so forth. '
, ge grade in all courses taken during the freshman year, each grade In connection with the use of training records in general as criterion
g weighted by the number of course points for which it waJJ~ceived. measures, a useful distinction is that between intermediate and ultimate
variant of the criterion of academic achievement frequently em- criteria: In the development of an Air Force pilpt-selection test or a medi-
yedwith out-of-school adults is the amount of education the individual cal aptitude test, for example, the ultimate criteria would be combat
pleted. It is expected that in general the more intelligent individuals perfo~mance a~d eventual achievement as a practidng physician, re-
tinue their education longer, while the less int~lli ent drop out of spectively. ObVIOuslyit would require a long time for such criterion data
001 earlier. The assumption underlying this erite . that the educa- to mature: It is. doubtful, moreover, whether a".truly ultimate criterion is
al ladder serves as a progressively selective . nee, eliminating ever obtamed m actual practice. Finally, even were such an ultimate
se incapable of continuing beyond each step. Although it is undoubt- criterion available, it would probably be subje,ct to many uncontrolled
ly true that college graduates, for example, represent a more highly
44 Principws of Psychological Testing Validity: Ba51c Concepts 145
tors that would render it relatively useless. For example, it would be . The method o~ contrasted groups is used quite commonly in the valida-
cult to evaluate the relative degree of success of physicians practicing tion of personahty tests. Thus, in validating a test of social traits, the
rent specialties and in different parts of the country. For these rea- test perform~nce of salesmen or executives, on the one hand, maybe
sueh intermediate criteria as performance records at some stage of compar~d WIth that of clerks or engineers, on the other. The assumption'
iningare frequently employed as criterion measures. underlymg such a procedure is that, with reference to many socialtraits
or many purposes, the most satisfactory type of criterion measure is individua.ls who hav~ entered and remained in such occupatiQp~r~s selling
t based on follow-up records of actual ;ob performance. This criterion or executive work Will as a group excel persons in such fie1~&~iisclerical
.been used to some extent in the validation of general intelligence as work or engineering. Similarly, college students who hav~.'~ngaged in
as personality tests, and to a larger extent in the validation of special man! .extracl~rricular activities may be compared V\'ith those who have
tude tests. It is a common criterion in the validation of custom-made partlcIp~ted 111 nOlle during a comparable period of college attendance.
for specine jobs. The "jobs" in question may vary widely in both Oc~up~tlOl1al.groups have frequently been used in the development and
and kind, including work in business, industry, the professions, and vahdabon ?f mterest tests, such as the Strong Vocational Interest Blank,
armed services. Most measures of job performance, although prob- as well as ~n the preparation of attitude scales. Other groups sometimes
not representing ultimate criteria, at least provide good inter- employed. III the validation of attitude scales include political, religious,
iate criteria for many testing purposes. In this respect they are to be ~eograp~lCal, or other special groups generally known to represent dis-
erred to training records. On the other hand, the measurement of tmetly different points of \oiew on certain issues.
perform;mce does not permit as much uniformity of conditions as is In the developme.nt of certain personality t~sts, psychiatric diagnosis is
ible during training. Moreover, since it usually involves a longer used both as a basIS for the selection of items and as evidence of test
low-up,the criterion of job ptrformanee is likely to entail a loss in the v~lidity. Ps.y~hiatric diagnosis may serve as a satisfactory criterion pro-
mber of available subjects. Because of the variation in the nature of VIded that It is based on prolonged observation and detailed case history
minallv.similar jobs in different organizations, test manuals reporting rather than on a cursory psychiatric interview or examination. In th~
~ditydata against job criteria should describe not only the specific latter. case, there is no reason to expect the psychiatric diagnosis to be
'terion measures employed but also the job duti~s performed by the supenor to the test score itself as an indication of the individual's emo-
rkers. tion~l ~ondition. Such a psychiatric diagnosis could not he regarded as
Validation by the method of contrasted groups generally involve~ a ~ c:ltenon measure, but rather as an indicator or predictor whose own va-
ill osite criterion that reflects the cumulative and uncontrolled selectIve lidity would have to be determined.
fluencesof everyday life. This critcrion is ultimately based on sur\'iY~1 Mention has already been made, in connection with other criterion
'thin a particular group versus elimination therefr?m. For.ex~mp.le,.~n, catel?o~ies, of certain types of ratings by school teachers, instructor,s in
e validation of an intelligence test, the scores obtamed by mSbtutlOnal- speCialized. cou~s.es. an~ jO~ supervisors. To these can be added ratings
mentally retarded children may be compared with those obtained by officers m mIlitary Situations, ratings of students bv school counselors
schoolchildren of the same age. In this case, the multiplicity of factors and ratings by co-workers, classmates, fellow club-~embers, and othe;
etermining commitment to an institution for the mentally ret~rded con- grou?~ of associ~tes. The ratings discussed earlier represented merely a
stitutes the criterion. Similarly, the validity of a musical aptitude or a suhsldl~ry tec?mque for obtaining information regarding such criteria as
echanical aptitude test may he checked by comparing the scores ob- academIC achievement, performance in specialized training, or job suc-
ained by students enrolled in a music school or an engineering school, ce:s. :Ve are now considering the use of ratings as the viery core of the
espectively, with the scores of un selected high school or college student~. cntenon mea:ur~. Under these circutJIstances, the ratings themselves
To be sure, contrasted groups can be selected on the basis of any cn- define the crltenon. Moreover, suchuatings are not restricted to the
terion, such as school grades, ratings, or job performa!!ce, by simply evaluation of speci~c achievement, ~ut involve a personal judgment by
choosingthe extremes of the distribution of criterion metsures. TIle con- an observer regardmg any of the vanety,of traits that psychological tests
trasted groups included in the present category, h~~wer, are disti~ct attempt to measure. Th~s, .the subjects in the vali~\ltion sample might be
groupsthat have gradually become differentiated through the ope~ation ~ate~ on such charactensbcs as dominance, mechll.:nical ingenuity, orig-
of the multiple demands of daily living. The criterion under cons~dera- mality, leadership, or honesty. .,;,.
tionis thus more complex and less clearly definable than those preViously Ratings have bee~ employed in t?e validl!tion of;,almost every type of
discussed. test. They are partIcularly useful In providing criteria for personality
146 Principles of PSljchological Testing
'J
Validity: Basic COllcepts 147
;,tests,since objective criteria are much more difficult to find in this area. Similar .vari~tion with r~gard to the prediction of course grades is il-
}lfhisis especially true of distinctly social traits, in which ratings based on lustrated m Flgure 16. ThIS Bgure shows the distribution of correlations
;personal contact may constitute the most logically defensible criterion. obtained between grades in mathematics and scores on each of the sub-
iiAlthoughratings may be subject to many judgmental errors, when ob- tests Of,the Differential Aptitude Tests. Thus, for the Numerical Ability
. )ained under carefully controlled conditions they represent a valuable test (NA), the largest number of validity coefficients among boys fell
's9urce of criterion data. Techniques for improving the accuracy of between .50 and .59; but the correlations obtained in different mathe-
i:iatingsand for reducing common types of errors will be considered in m~tics ~ourses and in different schools ranged from .22 to .75. Equally
,{,Chapter20. Wide dlff~rences we~e found with the other subtests and, it might be
,11 Finally, correlations between a new test and previously available tests added, WIth grades 10 other subjects not included in Figure 16.
i~arefrequently cited as evidence of validity. When the new test is an ab-
,breviated or Simplified form of a currently available test, the latter can
,;Properly be regarded as a criterion measure. Thus, a paper-and-pencil 20
72 coefficients for general
',test might be validated against a more elaborate and time-consuming per-
c1erh on intelligence tests,
<i, formancetest whose validity had previouslv been established. Or a group proficiency criteria
ftest might be Ivalidated:against an individu~l test. The Stanford-Binet, for
:lhample, has repeatedly served as a criterion in validating group tests. c:~
., 10
"In such a case, the new test may be regarded at best as a crude appro xi- 'u
~
~mation of the earlier one. It should be noted that unless the new test ••0
U
,represenl~a simpler or shorter substitute for the earlier test, the use of the 0
~ -1.00
,';latter as a cdterion is indefensible. ~ +1.00
t, '0
,;
... 20
>
0
191coefficients for bench
1 SPECIFICITY OF CRITERIA.Criterion-related validity is most appropriate "0
Ol
workers on finger dexterity
'for local validation studies, in which the effectiveness of a test for a C.,~ tests, proficiency criteria
,; specificprogram is to be assessed. This is the approach followed, for ••
0..
10
, example, when a given company wishes to evaluate a test for selecting
, applicants for one of its jobs or when a given college wishes to determine
i how well an academic aptitude test can predict the course performance
~"ofits students. Criterion-related validity can be best characterized as the o
~practical validity of a test in a specified situation. This type of validation -1.00 -0.50 .00 +0.50 +1.00
',represents applied research, as distinguished from basic research, and as FIG. 15. Examples of Variation in Validity Coefficients of Given Tests for Par-
: such it provides results that are less generalizable than the results of ticular Jobs.
I other procedures. (Adapted from Ghiselli, 1966, p. 29.)
That criterion-related validity may be quite specific has been demon-
, strated repeatedly. Figure 15 gives examples of the wide variation in the
correlations of a single type of test with criteria of job proBciency. The Some. of .the variation in validity coefficients against job criteria re-
, .~'firstgraph shows the distribution of 72 correlations found between in- ported l.n FIgure 15 r~ults from differences among the specific tests em-
~:telligence test scores and measures of the job proficiency of general plo ed 10 different studies to measure' . , .. :~2,
rity. In
c. clerks; the second graph summarizes in similar fashion 191 correlations
the resu s 0 0 19ures and 16, moreover/some variation is at-
. :' between finger dexterity tests and the job proficiency of benchworkers. tributable to diHerences in homogeneity and lev~l~£ the groups tested .
,:; Although in both instances the correlations tend to chIster in a particular The range of validity coefficients found, however, is far wider than
',~range of validity, the variation among individual studies is considerable. could be explained in these terms. J)ifferences in the' crjtena themselves
~,.The validity coefficient may be high and positive in one study and negli- ~un~oubtedb' a m.!!iorr~ason for th.~~~ariatiQnQ~~~rvgafilong vali<lliy
'; gible or even substantially negative in another. c~. Thus, thc duties of offic~g~rks or berichworkers may differ
£.1"
148 Principles of Psychological Testing
. de artments in the same company.
'dely among compames or amo~. tP differ in content, teaching rather than static. It follows that criterion-related validity is itself subject
milarlv, courses in the same su Jec may t' student achieve- to temporal changes.
ethod'instructor characteristics, bases for evalua mg . to be the
' c· ntly w llat appears
ellt, and l~umerous other ways. o~sd~e:ent ' combmation of traits in
e critefJon ma resent ver
SYl':mETIC VALIDITY. Criteria not only differ across situations and over <
i rrent situations. t'. the same situation. For example, time, but they are also likely to be complex (see, e.g., Richards, ~llor,
riteria may also vary over Ime In. .. criteria often differs Price, & J aeo bsen, 1965). Success on a job, in school, or in other actiryties
e validitv coefficient of a test against Job .tra~m(th' lli 1966) There of daily life depends not on one trait but on many traits. Hence, 'prac-
omits v~lidity against job performance cntena Ise, ce of ~ iven tical criteria are likely to be multifaceted. Several ,different indicators
'evidence that the traits required for successful terfo~~:nor job e;peri- or measures of job proficiency or academic achievement could thus be
b or even a. si~g~detaslk(;~r~' ~vi~n th:9;~~o~~iS~!::c& Fruchter, 1960; used in validating a test. Since these measures may tap different traits
ce of the mdivi ua eiS m .' . ' . 1960) There is also or combinations of traits, it is not surprising to find that they yield differ-
ent validity coefficients fpr any given test. '\'hen different criterion
~:~~::le~d~:1~~I'Sh~~~6;h~~~~I~ri:ri~~:~ge ove~. timt.e fo1rgOotha~r
. f . b shIfts In orgamza IOna
measures are obtained for the same individuals. their interoorre!atioDs are
asons such as changmg nature (} )0 s, al d't'ons ('.lac- , often
\" quite low. For instance, accident records or absenteeism may show
' . k d ther tempor con 1 1 IV.
dividual advancement In ra~ ' an kn° f course that educational virtually no relation to productivity or error ,data for the same job (Sea-
. 1967 P . 1966) It IS we ll own, 0 ,
.nne)',
' . d; nen,. t t change over t'Ime. In other words , the shore, Indik, & Georgopoulos, 1960). These differences, of course, are
meula an course con en. ... IIi ence and aptitude tests reflected in the validity coefficients of any given test against different
teria most commonly used m vaiidatmg mte g d'
'namely, job performance and edut:ational achievement-are ynamlc criterion measures. Thus, a test may fail to correlate significantly with
supervisors' ratings of job proflciency and yet show appreciable validity
in predicting who will resign and who will be promoted at a later date
(Albright, Smith, & Glennon, 1959),
Because of criterion complexity, validating a test against a composite
criterion of job proficiency, academic achievement, or other similar ac-
complishments '~a be of uestionable value and is certainl of limited
generality. If different subcriteria are relatively independent, a more ef-
fectIve procedure is to validate each test against that aspect of the cri-
teiiO'i1Jf IS best designed to measure. An analysis of these more speCific
reIahonships lends meaning t6 the test Scores in terms of the multiple
dimensions of criterion behavior (Dunnette, 1963; Ebel, 1961; S. R. Wal-
lace, 1965). For example, one test might prove to be a valid predictor of
a clerk's perceptual speed and accuracy in handling detail work, another
of his ability to spell correctly, and still another of his ability to resist
distraction.
If, now, we return to the practicClI question of evaluating a test or
combination of tests for effectiveness in predicting a complex criterion
such as success on a given job, we are faced with the necessity of con-
ducting a separate validation stud in each loc tion and re eatin
16. GraphIC• Summary 0 f· "\' al'd'ty
I I Coefficients of the . Differential
Th bad Aptitude
ac- it at frequent mten~ S. This is admittedly a desi procedure and one
. (Forms Santi T) for Course Grades in Mathematics! em ~rst ~ the that is often recommended in test manuals. In ma~r situations, however,
anyingnumb~r.sin each column indicate the number 0 coe clen S In
givenat the left. " it is not feasible to follow this procedure be~jise of well-nigh insur-
mountable practical obstacles. Even if adequatel~ p'ained' personnel are
R roduced by permiSSIon. CopyrIght © 1975,
Fifth Edition Manual, p. 82: eP Y k N Y All rights reserved.) available to carry out the necessary research, mosf Critf:'rion-related va-
by The Psychological Corporatlon, N
ew or, .•
lidity studies conducted in industry afe likely to prove unsatisfactory for
tso Principles of Psychological Testing
helpful in improvin~ th~ employment opportunities of minority appli-
at leastthree reasons, First, it is difficult to obtain dependable and suf- cants and persons WIth lIttle formal education, because of its concentra-
Scientlycomprehensive criterion data. Second, the number of employees tion on job-relevant skills (Primoff, 1975).
engagedin the same or closely similar jobs '~ithin a co~pany i,s often , A different application of synthetic validity, especially suitable for use
60 small for significant statistical results. Thlfd, correlations will very m a sn~all company with few employ~es in each type of job, is described
~robablybe lowered by restriction of range through preselection, si~ce by Gmon (1965). The study was carried out in a company having 48
polythose persons actuany hired can be followed up on .the Job. employee~, each of whom was doing a job that was appreCiably different
: For all the reasons discussed above, personnel psychologJ.sts have from the Jobs of the other employees. Detailed job analyses nevertheless
- shownincreasing interest in a technique 1.."110\\'11 as synthetic validity. revealed seve.n job elements commo!}Jto many jobs. Each employee was
\Firstintroduced by Lawshe (1952), the concept of synthetic validity has rated on the Job elements appropriate to his job; and these ratings were
!beendefined by Balma (1959, p. 395) as "the inferring of validity in a then checked against the employees' scores on each test in a trial battery.
specificsituation from a systematic analysis of job elements, a determina- On the basis of these analyses, a separate battery could be "svnthesized"
_Honof test validity for these elements, and a combination of elemental for each job by co~bining the two best tests for each of the j~b elements
fvalidities into a ~'hole." Several procedures have been developed for demanded by that Job. When the batteries thus assembled were applied
",.gathering the1needed empirical data and for ~mbining these d~ta. to t~ a subsequently hired group of 13 employees, the results showed con-
, obtainan estimate of synthetic validity for a particular complex cntenon SIderable promi~e. Because of the small number of cases, these results
(see,e.g., Guion, 1965; Lawshe & Balma, 1966, Ch. 14; McCormick, 1959; are only suggestive. The study was conducted primarily to demonstrate a
l'rimoff, 1959, 1975). Essentially, the process involves three steps: (1) model for the utilization of synthetic validity.
_. detailed job analysis to identify the job elements and their relative The two examples of synthetic validity were cited only to illustrate
_weights; (2) analysis and empirical study of each test to determine ~he the scope of possible applications of these techniques. For a description
.i extent to which it measures proficiency in performing each of these Job
of the actual procedures followed, the reader is referred to the ariginal
elements; and (3) finding the validity of each test for the given job sources .. In ~ummary, the Concept of synthetic validity can be imple-
synthetically from the weights of these elements in the job and in the ~ented III diHerent ways to fit the practical exigencies of different situa-
test. tIOns.. It oH~rs .a promising approach to the problem of complex and
In a long-term research program conducted with U.S. Civil Service job changmg. cntena; and it permits the assembling of test batteries to fit
applicants, Primoff (1975) has developed the J-coefficient (for "job- ~he reqUIrements of specific jobs and the detennination of test validity
coefficient") as an index of synthetic validity. Among the special features 1D many contexts where adequate criterion-related validation studies are
of this procedure are the listing of job elements expressed in terms of impracticable.
worker behavior and the rating of the relative importance of these ele-
ments in each job by supervisors and jo1},}p$]Jmbents. Correlations be-
tween test scores and sell-ratings on jOp;Jj~m~~s are found in total ap-
plicantsamples (not subject to thep~1f'~-~~lW? of employed workers).
Various chec1..ing procedures are fon9~~ed to ensure stability of correl~- !he construct validity o~ a test is the extent to which the test may be
tions and weights derived from self-~~~gs. as wen as adequacy of C[l- saId to me~ure. a theoretical construct or trait. Examples of such con-
terion coverage. For these purpose.s;~a~a_ are ?btained from d~Herent structs ~re mtelhge~~, mechanical comprehension, verbal fluency, speed
samples of applicant populations. 1\~~£nal estimate of correlation be- of ,;alking, neurotiCIsm, and anxiety. Focusing on a broader, more en-
tween test and job performance is,!9Pnd from the correlation of each dunng, .and more abstract kind of behavioral description t'han the previ.
job element with the pifticular job';~~ the weight of the same element ously dlscusse~ types ,of validity, construct validation requires the grad-
in the given test.' There i" evidence that the J-coefficient has proved ual a~um~latIon of mfonnation from a variety of sources. Any data
~f·' thrOWIng hght on the nature of the trait under consideration and the
, The statistical procedures aTe essentiaIly an adaptation of multiple regression ~~~tions .aHecting i~ developm.e~t and manifestations' are grist for this
equations, to be discussed in Chapter- 7. For each job element, its correlation with ,al~dl~ mill: IllustratIOns of speCific technique~ $uitabl~, for construct
the job is multiplied by its weight in the test, and these produtcs are added across all
vahdatlon Will be considered below. ':&-"
appropriate job elements.
Validity; Basic Concepts 153
acco~~ing to. a hierarchical pattern of learned skills, they, too, can utilize
empmcal eVidence of hierarchical invariance in their validation.
";: DEVELOPMENTAL CHANGES. A major criterion employed in the validation
',',ofa number of intelligence tests is age d.ifferentiation. Su.ch tests a.~the
,Stanford.Binet and most preschool tests arc checked agamst chronolog- CO~~Anoss WlTIl OTHER TESTS. Correlations between a new test
':ical age to determine whether the scores show a pr~gressive i~crease and slIDllar earlier tests are sometimes cited as evidence that the new test
, .with advancing age. Since abilities are expected to mcre~se \~lth age me~sures apprOximately the same general area of behavior as other tests
, ,during childhood, it is argued that the test scores should likewise show des~gnated by"the ~ame name, such as "intelligence tests" or "'mechanical
, such an increase, if the test is valid. The very concept of an age scale aphtude tests: Unlike the correlations found in criterion-related validity,
,:'0£ intelligence, as initiated by Binet, is based on the assumption that "in- these correlahons sh~uld be ~oderately high, but not too high. If the new
~telligence"increases with age, at least until maturit,Y- . . test correlates too lughly With an already available test, withuut such
The criterion of age differentiation, of course, IS mapp1icable to any added advantages as brevity or ease of administration, then the new test
,functions that do not exhibit clear-cut and consistent age changes. In the represents needless duplication.
area of personality measurement, for example, it ~as found li~ited u~e. Correlations with other tests are employed in still another way to
Moreover, it should be noted that, even when apphcable, age differentia- d~m~nstrate that the new test is relatively free from the influence of cer-
tion is a necessary but not a sufficient condition for validity. Thus, if the ta~n m~le:ant factors. For ex~~ple, a special aptitude test or a person-
. test scores fail t~ improve with age, such a finding probably indicates alItr teat "hould hav.e a neglIgtble correlation with tests of general in-
" that the test is not a valid measure of the abilities it was designed to te1hgence ~r scholastic aptitude. Similarly, reading comprehension should
."sample. On the other hand, to prove that a test measures something not appreCiably affect performance on such tests. Thus, correlations with
that illcr,eases with age does not define the area covered by the test very t~sts of general intelligence, reading, or verbal comprehension are some-
precisely. A measure of height or weight would al~o show regul~r ag~ h~es reporte~ as indirect or negative evidence of validity. In these cases,
inc1'ements,although it would obviously not be deSignated as an mtelli- hlgh correlations, would make the test suspect. Low correlations, how-
'\ gencetest. . .' ever, would n~t 10 t~emselves insure validity. It will be noted that this
A final point should be emphasized reg~rding the. mterpretahon .of ~e use o~ correlations With other tests is similar to one of the supplementary
age criterion. A psychological test validated a?amst such a cnteno~ techmques described under content validity. '
measures behavior characteristics that increase w1th age under the condl'
tions existing in the type of environment in which the test was stand-
ardized. Because different cultures may stimulate and foster the develop- FA~OR ANALYSrs.Of particular relevance to construct validitv is fador
ment of dissimilar behavior characteristics, it cannot be assumed that an~lySlS: a s~atistical procedure for the identification of psy~hological
the criterion of age differentiation is a universal one .. Lik~ all ~th~r ~ralts. E,s~entia.lly, factor analysis is a refined technique for analyzing the
, criteria, it is circumscribed by the particular cultural settmg m whlCh It I~terrelationships of behavior data. For example, if 20 tests have been
is derived. glven ~o 300 persons, the first step is to compute the correlations of each
Developmental analyses are also basic to the construct validation of t~st Wlth e:ery other., An inspection of the resulting table of 190 eoi-rela-
the JPiagetian ordinal scales cited in Chapter 4. A fundamental assump, ti,O~ may Itself reveal. certain clusters among the tests, suggesting the 10-
tion of such scales is 1thesequential patterning of development, such that catI?n of common traIts. Thus" it_ tests as vocabulary, analogies op-
the attainment of earlier stages in concept development is prerequisite to pOSites, and sent~nce ~mpletioJl •• high correlations with each ~ther
the acquisition of later conceptual skills. T'here is thus. an ~ntrinsic h~er- and low correlations With all ot~ ~ts, we could tentatively infer the
archy in the content of these scales. The construct vahdahon of ~rdi~al pre~en.(:e of a verbal :omprehe~ioj "tor. Because',~uch an inspectional
scales should therefore include empirical data on the sequential 10- ana ~m of .a ~rrelaho~ table is ~t and uncetjtirln, however, more
variance of the successive steps. This involves checking the performance precIse statistical teclm1ques have blWft developed to locat th
. cd """'- e e common
of children at different levels in the development of any tested concept, f ac
. t ors reqmr
ff "rre,a·ned co i ti'ons. Th ese tee h -
to account for the'ttbtai,
such as conservation or object permanen,ce. Do children who demonstrate m~ues a .actor a~alysis will be e~amiil~d further in Chapter 13, together
mastery of the concept at a given level :also exhibit mastery at the ~ower WIth multnple aptItude tests developed~~y means of I~r analysis.
levels? Insofar as criterion-rt:ferenced tests are also frequently deSIgned
Validity: BasicCaacepts 155
Principles of PSljchological Testing
~orrelation of .subtest scores with total score. Many intelligence tests, for
n the process of factor analysis, the number of variables or .cate~ories lD:tance,. con~lst of separately administered subtests (such as vocabulary,
erms of which each individual's performance can be descnbed lS re- anthmehc, picture completion, etc.) whose scores are combined in finding
ed from the number of original tests to a relatively small number of the total test score. In the construction of such tests, the scores on each
rs, or common traits. In the example cited above, five or six factors
subtest are often correlated with total score and any subtest whose cor-
t suffice to account for the intercorrelations among the 20 tests. Each relation with total score is too low is eliminated. The correlations of the
'dual might thus be described in terms of his scores in the five or six rem~ining sUbte~ts with total score are then reported as evidence of
ors, rather than in tcrms of the original 20 scores. A major purpose of the Internal consistency of the entire instrument.
(>ranalysis is to simplify the description of behavior by reducing the . It is app.arent that internal consistency correlations, whether based on
er of categories from an initial multi licit of test vari bles to a few
1
Items or subtests,. are essentially measures of homogeneity. Because it
ac helps to charactenze the behavior domain or trait sampled by the test, the
;Aft~rthe factors have been idcntified, they can be utilized in describing degree of homogeneity of a test has some relevance to its construct valid-
e factorial composition of a test. Each test can thus be cl1afacterized in ity .. Ne.vert~eless, ~he contribution of internal consistency data to test
rmsof the l1)a)or factors determining its scores, together with the weight vahdatlOn IS very limited. In the absence of data external to the test it-
r loading of each factor and the correlation of the test with each facto~. self, little can be learned about what a test measmes.
uch a correlation is known as the factorial validity of the test. Thus, lf
he verbal comprehension factor has a weight of .66 in a vocabulary test,
he factorial validity of this vocabulary test as a measure of the trait of EFFECT OF EXPERIYENTAL VARIABLES ON TEST SCORES' A further source
erbal comprehension is .66. It should be noted that factorial validity is of data forconstmct validation is provided by ex-periments on the effect
entially the correlation of the test with whatever is common to a group of selecte(;I~ariables on test scores. In checking the validitv of a ~riterion-
of tests or other indices of behavior. The set of variables analyzed can, referellce'O test for use in an individualized instruction~l program, for
ofcourse, include both test and nontest data. Ratings and other criterion example, one approach is through a comparison QE pretest and posttest
'measurescan thus be utilized, along with other tests, to explore the fac- scor~s.. The rationale of such a test calls for low scores on the pretest,
torial validity of a particular test and to define the common traits it admlms~ered b~fore ~he relevant instruction, and high scores on the post-
measures. test. ThiS relationshIp can also be checked for individual items in the
te~t (Po.pharo, 1971). Ideally, the largest proportion of examinees should
fall an Item ?n the pretest and pass it on the posttest. Items that are
INTERNAL CONSISTENCY. In the published descriptions of certain tests, commonly falled on both tests are too difficult, and those passed on both
especially in the area of personality, the statement is made that the test tests ~oo easy, for t~e purposes of such a test. If a sizeable proportion of
has been validated by the method of internal consistency. The essential exa~mees pass an ltem on thc pretest and fail it on the posttest, there is
characteristic of this method is that the criterion is none other than the obvlOusly something wrong with the item, or the instruction, or both.
-total score on the test itself. Sometimes an adaptation of the contrasted A. test designed to measure anxiety-proneness can be administered to
. grOUpmethod is used, extr'"eme groups being selected on the basis of the
sub!ects who are subsequently put through a situation designed to arouse
total test score. The performance of the upper criterion group on each test i
amQe.~, such as .t~~ng an examination under distracting and stressful
item is then compared with that of the lower criterion group. Items that conditions. The lDltlal anxiety test scores can then be correlated with
fail to show a significantly greater proportion of "passes" in the upper phySiolog!cal. and other indices of an~iety expression du~pg and after
than in the lower criterion group are considered invalid, and are either the exammatIon. A different hypothesis regarding an anxietY· test could
~liminated or revised. Correlation~l pr.qcedures may also be employed for ?e evalua~ed by admini~tering the test before and after an anxiety-arous-
this purpose. For example, the biserial'correlation between ."pass-f~il" .on mg expen:~ce an~ seemg whether test scores rise Significantly on the
each item and total test score can be computed. Only those Items )'leldmg retest. PosItive flndmgs from such an experiment would indicate that the
significant item-testcorr~fliJi.Pns would be retained. A test whose items test scores. reBect current anxiety level. In a similar w,lI.y;' exper4;h.lents
were selected by this meth,qd can be said to show internal consistency, can be. designed to test any other hypothesis regarding th.~;;tfait ~~SUred
since each item differentiates in the same direction as the entire test. by a gIVen test.' .'
Another application of the criterion of internal consistency involves the
TABLE 12
NVERGENT AND DISCRIMINANT VALIDATiON. In a thoughtful analysis A Hypothetical Multitrait-M:ultimethod Matrix
nstruet validation, D. T. Campbell (1960) points out that in order (From Campbell & Fiske, 1959, p. 82.)
emollstrate construct validity we must show not only that a test cor-
es highly with other variables with whi~h .it should ~heoret.ically
elate, but also that it does not correlate sIgmficantly wIth van abIes
which it should differ. In an earlier article, Campbell and Fiske Traits
) described the former process as convergent validation and the
A,
er as discriminant validation. Correlation of a mechanical aptitude
with subsequent grades in a shop course would be an example of Method 1 B,
vergent validation. For the same test, discriminant validity would be C,
rated by a low and insignificant correlation with scores. on a .reading
prehension test, since reading ability is an irrelevant varIable m a test
A,
gnedto measure mechanical aptitude. ., .
t will be recalled that the requirement of low correlatlOn WIth trrele- Method 2 B.
t variables was discussed in connection with supplementary and pre- C.
tionary procedures followed in content validation. Discrin;inant va~-
ionis also especially relevant to the validation of personality tests, In ,...56::-:- ...~22---:11: 67-'---42-------:
ich irrelevant variables may affect scores in a variety of ways. II.....
',', .•.
',.
I .33 I:
1'"
I : ......•......
, ....•....•..
ampbell and Fiske (1959) proposed a systematic experimental deSign :.23'".58"',)2: :.43 '".~6',,:.34:
the dual approach of convergent and discriminant validation, which I "... ....,~ t •••••• I
ey called the multitrait-multimet1lOd J7latrix. Essentially, this procedure l ~~1 :..~~:~~~45L~~ .:~~~::::~58~
.58 (.85)
•...• ·::;"'-~"':'_~~~'~:;~'::>'.-~;~I;;,';"_:~~~~ ..•_.:;.::~:=.t;,..~~~':Q.~IM&~)
quiresthe assessment of two or more traits by tw.o Qr ~ore metho~s. A
Note: Le~tersA. B, C refer to traits, subSCripts1,2,3 to methods. Validity coefficients
pathetical example provided by Campbell and FIske WIll serve to IUUS-
(rnon~tralt-heteromethod) are the three diagonal sets of boldface numbers; reliability
ate the procedure. Table 12 shows all possible correlations among the c~efficlents (~ono~ralt-rnonomethod) are the numbers in parentheses along principal
ores obtained when three traits are each measured by three methods. diagonal. Sohd tnangles enclose heterotrait-monomethod correlations; broken tri-
, e three traits could represent three personality characteristics, such as angles enclose heterotrait-hcteromethod correlations.
A) dominance, (B) sociability, and (C) achievement motivation. The 'l
hreemethods could be (1) a self-report inventory, (2) a projective tech- tween different traits measured by the same method. For example, the
'iquc,and (3) associates' ratings. Thus, Al would indicate dom~na~ce COITf:lationbetween dominance scores from a self-report inventory and
oreson the self-report inventory, A2 dominance scores on the projective dOITt~ijancescores from a projective test should be higher than the cor-
est,C3 associates' ratings on achievement motivation, and so forth. relatIon between dominance and sociability scores from a self-report in.
The hypothetical correlations giv~n in Ta~le 12 include reli.ability co- ventor~. If ~he l~tter correlation, representing common method variance,
fficients (in parentheses, along principal dIagonal) and validity coef- :-rere hIgh, It mIght inllicate, for example, that a person's scores on this
cients (in boldface, along three shorter diagonals). In these validity Inventory are unduly affected by some irrelevant common factor such as
coefficients,the scores obtained fc",~~~p same trait by different methods ability to understand the questions or desire to make oneself appear in a
arecorrelated; each measure "is.thu~ 'being checked against other, inde- favorable light on all traits.
pendent measures of the same'::'trait, ~s'.in the familiar validati~n proce- Fiske (1973) has added still another set of correlations that should be
dure. The table also includes correlations between different traIts meas- checke~, esp~cially in the construct validation of personality tests. These
ured by the same riJ":.thod'(in solid triilngles) ~nd corrclati.ons between ~rrelab~ns Involve the same trait measured by the"same method, but
different traitsllleasured by different methods (Ill broken trIangles). For With a dlffer~nt test. For examplc, two il)vestigators may each pliepare
satisfactory construct validity, the validity coefficients should obviously a self-report Inventory designed to assesseIl,durance. Yet the end~rance
be higher than the correlations between different traits measured by scores obtained with the two inventories may show quite diffe~~nt. pat-
different methods; they should also be higher than the correlations be- terns of correlations with measures of other personality traits. Under these
Validity: Basic Conc('pts 159
at a higher educational level, as when selectinO' b hiO'h
t:< school students for
.ditions,it cannot be concluded tllat both inventories measure the same
college admission, it needs to be evaluated against the criterion of sub-
·sonalityconstruct of endurance. ., . sequent college performance rather than in terms of its content validity.
t might be noted that within the framework of the mnlhtrmt-mulh-
The examples given in Table 13. focus on the differences among the
hod matrix, reliability represents agreement between two measures of
various types of validation procedures. Further consideration of these
same trait obtained through maximally similar methods, such as
procedures, however, shows that content, criterion-related, and construct
alle! forms of the same test; validity represents agreement between
validity do not correspond to distinct or lOgically coordinflte categories.
measures of the same trait obtained by maximally different methods,
On the contrary, construct validity is a comprehensive concept, which
chas test scores and supervisor's ratings. Since similarity and difference
includes the other types. All the specific techniques for establishing con-
methods arem~tters of degree, theoretically reliability and validity can
tent and criterion-related validity, discussed in earlier sections of this
regarded as falling along a single continuum: O~~inarily, ho\~'e~er, the
chapter, could have heen listed again under construct validity. Comparing
hniques actually employed to measure rehabllIty and validIty cor-
the test performance of contrasted groups, such as neurotics and normals,
ond to easily identifiable regions of this continuum.
is one way of checking the construct validity of a test designed to meas-
ure emotional adjustment, anxiety, or other postulated traits. Comparing
the test scores of institutionalized mental retardates with those of normal
schoolchildren is one way to investigate the construct validity of an
We have considered several ways of asking, "How valid is this test?" intelligence test. The correlations of a mechanical aptitude test with per-
formance in shop courses and in a wide variety of jobs contribute to our
Topoint up the distinctive features of the different types of validity, let
understanding of the construct measured by the test. Validity against
us apply each in turn to a test consisting of 50 assorted arithmetic prob-
various practical criteria is commonly reported in test manuals to aid the
lems.Four ways in which this test might be employed, together with the
potential user in understandin~ what a test measures. Although he may
type of validation procedure appropriate to each, a:e illustra:ed ~n Table
not be directly concerned with the prediction of any of the specific cri-
13. This example highlights the fact that the chOIce of valIdahon pro-
teria employed, by examining such criteria the test user is able to build
; cedure depends on the use to be made of the test scores. The same test,
up a concept of the behavior domain sampled by the test.
when employed for different purposes, should be validated in different
Content validity likewise enters into both the construction and the
ways.If an achievement test is useet to predict subsequent performance
subsequent evaluation of all tests. In assembling items for any new test,
the test constructor is guided by hypotheses regarding the relations be-
TABLE 13 tween the type of content he chooses and the behavior he wishes to
Validationof a Single Arithmetic Test for Different Purposes measure. All the techniques of criterion-related validation, as well as the
other techniques discussed under construct validation, represent ways of
Illustrative Type of
Validity testing such hypotheses. As for the test user, he too relies in part on
Question
content validity in evaluating any test. For example, he may check the
How much has Dick vocabulary in an emotional adjustment inventory to determine whether
. Achievement test in ele- some of the words are too difficult for the persons he plans to test; he
mentary school aritlune- learned in the past?
may conclude that. the scores on a particular test depend too much on
tic Criterion-related:
Aptitude test to predict How well will Jim learn in speed for his purposes; or he may notice that an intelligence test de-
the future? predictive veloped twenty years ago contains many obsolescent items unsuitable for
performance in high
school mathematics use today. All these observations about content are relevant to the con-
Does Bill's performance Criterion-related: struct validity of a test. In fact, there is no information provided by any
Technique for diagnosing
show specific disabili- concurrent
learning disabilities validation procedure that is not relevant to construct validity.
ties? The term construct validity was officially introduced into the psy-
Measure of logical reason- How can we describe
chome~rist's lexicon in 1954 in the Technical RecommenN4a{ions for Psy-
Henry's psychological
ing c11010glcal Tests and Diagnostic Techniques, which constituted the first
functioning?
edition of the current APA test Standards (1974). Although the validation
160 Pritlci,Jles of PSlJchological Testing
procedures subsumed under construct validity were not new at the time, it may open the way for s b" .
the discussions of construct validation that followed served to make the validity. Since . ~ J~chve, unvenfled assertions about test
. implications of these procedures more explicit and to provide a systematic cept, it has bE':~~~~~~ v;~~~~;s ~uc~ asbroad and loosely dcflned can-
,; rationale for their use. Construct validation has focused attention on the constructors Seem to ~r . . rs 00. ome textbook writers and test
role of psychological theory in test construction and on the need to psychological trait na~lescelVe It as content validity expressed in terms of
formulate hypotheses that can be proved or disproved in the validation subjective accounts of ~h~:~~e, t~e~ present as construct validity purely
process. It is particularly appropriate in the evaluation of tests for use A further source of ossibl ey e ~ve (o~ hope) the test measures.
in research. construct validation "is ; I e Co~fuslOn anses from a statement that
d
In practical contexts, construct validation is suitable for investigating a measure of some at .~vo ve w e~ever ~ test is to be interpreted as
; the validity of the criterion measures used in traditional criterion-related oned'." (Cronbach & ~:e~tel~r quahty whIch is not 'operationally de-
" test validation (see, e.g., James, 1973). Through an analysis of the cor-
relations of different criterion measures with each other and with other
published analysis of the co'
often incorrectl acce ted :c~p
5;,?f' 282). Appearing in the first detailed
~nstruct "alidity, this statement was
, relevant variables, and through factorial analyses of such data, one can the absence of ~ata ~hat t~ Justifrng a claim for construct validity in
learn more about the meaning of a particular criterion. In some instances, such an interpretati;n is i1lus:a~~t ors of .the sta~e~ent did. not intend
the r~sults of such a study may lead to modification or replacement of the article, that "unless the n t k d b ytheIr own inSIstence, III the same
criterion chosen to validatc a test. Under any circumstances, the results e war ma k es contact with b .
construct validation cannot b I' d" 0 servations . . .
will enrich the interpretation of the test validation study. h .. . e c alme (p. 291) In th .
t ey cnhclze tests for wh' h" fi . e same connectIon,
Another practical application of construct validation is in the evalu- been oHered as if it wcre l~al'; t.ne~pun network of rationalizations has
l
ation of tests in situations that do not permit acceptable criterion-related construct, trait or behavio d a I~n (p, 291). Actually, the theoretical
validation studies. as in the local validation of some personnel tests for '
b e a d equateI), defined only' r omam measured bv rti I
th I' h f - a pa cu ar test can
industrial use. The difficulties encountered in these situations were dis- validating that test Such I~ Iie. ~g t 0 data gathered in the process of
cussed earlier in thi.s chapter, in connection with synthetic validity. Con- abIes with which th~ test c~ ~ ~lhO~ would take into account the vari-
str~ct validation offers another alternative approach that could be fol- found to affect its Scores an~et~ ed SIgnificantly, as well as the conditions
lowed in evaluating the appropriateness of published-tests for a particular scores. These procedures are e ; ~~~ps that diff~r significantly in such
job. Like synthetic validation, this approach requires a systematic job butions made bv the co t n fIre :- m aCcord w1th the positive contrl-
analysis, followed by a description of worker qualifications expressed in . ncep 0 construct valid'ty I '
the empirical investigation of the r I' h' 1. t IS only through
;.:''terms of relevant behavioral constructs. If, now, the test has bcen sub- external data that we can d' ehahons IpS of test SCores to other
jected to sufficient research prior to publication, the data cited in the ISCOverw at a test measures.
manual should permit a specification of the principal constructs measured
by the test. This information could be used directly in assessing the
relevance of the test to the required job functions, if the correspondence
of constructs is clear enough; or it could serve as a basis for computing
a J-coefficient or some other quantitative index of synthetic validity. .
Construct validation has also stimulated the search for novel ways of I
gathering validity data. Although the principal techniques employed in

investigating construct validity have long been familiar, the field of
operation has been' expanded to admit a \\rider variety of procedures.
This very multiplicity of data-gathering techniques, however, presents
certain hazards. It is possible for a test constructor to try a large number
of different validation procedures, a few of which will yield positive re-
sults by chance. If these confirmatory results were then to be reported
without mention of all the validity probes that yielded negative results, a
very misleading impression about the validity of a test could be created.
I Another possible danger in the application of construct validation is that
HArTER 7
MEASUREMEXT OF RELATIONSHIP. A validity coefficient is a correlation
between test score and criterion measure. Because it provides a single
alidity: numerical index of test validity, it is commonly used in test manuals to
report the validity of a test against each criterion for which data are
available. The data used in computing any validity coefficient can also
Measuremel~t be expressed in the form of an expectancy table or expectancy chart,
illustrated in Chapter 4. In fact, such tables and charts provide a con-
venient way to show what a validity coefficient means for the person
and lrlterpretation tested. It will be recalled that expectancy charts give the probability that
an individual who obtains a certain score on the test will attain a speci-
fied level of criterion performance. For example, with Table 6 (Ch. 4,
p. 101), if we know a student's score on the DAT Verbal Reasoning test,
"",e can look up the chances that he will earn a particular grade in a
hIgh school course. The same data yield a validity coefficient of .66
HAPTER 6 was concerned with different concepts of validity and When both test and criterion variables are continuous, as in this example,
C
the familiar Pearson Product-Moment Correlation Coefficient is appli-
. their appropriateness for various testing. f~nctions; t~is. chapter
cable. Other types of correlation coefficients can be computed when the
deals with quantitative expressions of vahdlty and theIr mterpre-
data are expressed in different forms, as when a two-fold pass-fail cri-
tation. The test user is concerned with validity at either or both of two
terion is employed (e.g., Fig. 7, Ch. 4). The specific procedures for
stages. First, when considering the suitability of a test for his purposes,
computing these different kinds of correlations can be found in any
he examin~ailable validit)'data reported in the test manual or ot~er
standard statistics text.
p~ed so.Jltces..Through such in~ormation, he arrives at a tentative
concept of what psychological fu~ctlOns the test actually measures, and
he judges the relevance of such function~ to his p.rop~sed use of t~e test.
COI\"DITIONS AFFECTING VALIDITY COEFFlCIEXTS. As in the case of reli-
In effect, when a test user relies on published validation data, he IS dea.l-
ability, it is essential to specify the nature of the group on which a
ing with construct validity, regardless of the specific pro?ed~res used m
validity coefficient is found. The same test may measure different func-
gathering the data. As we have seen in Chapter 6, the cntena employed
tions when given to individuals who differ in age, sex, educational level,
in published studies cannot be assumed to be iden?cal. with th~se the
occupation, or any other relevant characteristic. Persons with different
test user wants· to predict. Jobs bearing the same title m two dIfferent
experiential backgrounds, for example, may utilize different work meth-
companies are rarely identical. Two courses in freshman English taught
ods to solve the same test problem. Consequently, a test could have high
in different colleges may be quite dissim~1;l.r· i
validity in predicting a particular criterion in one population, and little
Because of the specificity of each criterion, te~t users are .us~ally ad-
or no validity in another. Or it might be a valid measure of different
vised to check the validity of anv chosen, 'test agamst local cnterla when-
functions in the two populations. Thus, unless the validation s~ple is
ever possible. Although publishe'd dat~ay str~ngl~ sugg~st that a given
repri'!seiififiVe of the population on which the test is to be used, validity
test should have high validity in a particular sltuatio~, dlTee: corrobo~a-
should be redetermined on a more appropriate sample.
tion is always desirable. The dete:t'inination of validJ!Y agamst specific
The question of sample heterogeneity is relevant to the measurement
local criteria represents the second stage in the test ~r's evaluation of
of validity, as it is to the measurement of reliability,'.,since both charac-
valKTfty.The teChnIques ttr'1le dIscussed 1~ this chapter are esp~cially
teristics ale commonly reported in terms of correlation eoefficiElnts. It
relevant to the analysis of validity data obtamed by ~e test u.ser hlms~1f.
will be recalled that, other things being equal, the wider the range of
Most of them are also useful, however, in understanding and mterpretmg
scores, the higher will be the correlation. This fact should be kept in
the validity data reported in test manuals.
J6z
Validity: Mcasuremcnt and Interprctation ~65
mind when interpreting the validity coefficients given in test manuals. In other situations th 1" f b
,;. Il.special difHcttlt}, encountered in many validation samples arises from individual entries m;y d: .lfIte~ ~st 6t may be a straight line, but the
preselection. For example, a new test that is being validated for job selee- at the lower end of the s:~ e Sart er around this line at the upper than
.tionmay be admini$tered to a group of newly hired employees on whom aptitude test is a necClisa; ~ut u~:se that 'performa~c::e on a scholastic
;criterJIonmeaSures of job performance will eventua11y be available. It is achievement in a course Th t' h a tufficlent condItion for successful
~likely;however, that such employees represent a superior selection of all poorly in the cOU"se' bl!lt'a' a IS t, h~ how-scoring students will perform
• ,. mong the Ig -scor' t d .
"!hosewho applied for the job. Hence, the range of such a group in both fonn well in the course . d th mg s u ents, some WIll per-
.'tests¢ores and criterion measures will be curtailed at the lower end of the motivation. In this situat~:n ~h ers ",:::1
:erf~rm poorly because of low
:·bdistribution.the effe~t of such preselection will therefore be to lower the performance among the ·h.'h ere. WI e WIder variability of criterion
'validity coefficient. In the subsequent use of the test, when it is admin- dents, This condition in ~g ~sco~g t~an. am?ng the low-scoring stu-
dster¢d to all applicants for selection purposes, the validity can be ex- scedasticih.' Th p. bwanate dIstrIbution is known as hctero-
'J' e earson correlatio h
pected to be somewhllt higher. variability throughout tb ~ assum:s ?moscedasticity or eqll.al
" Validity coefficients may also change over time because of changing present example, the bivae..r:n~ o'b th~ bIVanate distribution. In the
.'selection standards. An example is provided by a comparison of validity at the upper end and n na e shtn utIon would be fan-shaped-wide
' , arrow at t e lower end A . ' .
,coefficients compll.ted over a 3D-year interval with Yale students (Burn- b Ivanate distribution itsdf ill 11 . . II exammation of the
"ham, 1965). Correlations were found between a predictive index based nature of the relations·hip b 'tV usua y give a, good indication of the
e ween test and 't' E
, on College Entrance Examination Board tests and high school records, and expectancy charts also I cn erIOn. xpectancy tables
f onthe one hand, and average freshman grades, on the other. This correla- the test at different levels. correct y reveal the relative effectiveness of
tion dropped from .11 to .52 over the 30 years. An examination of the
r' bivariate distributions dearly reveals the reason for this drop. Because of
~higher admissibn standards, the later class was more homogeneous than MAGNITUDE OF A V.Aj.LIDITY·COEFFr •
.:the earlier class in both predictor and criterion performance. Conse- coefficient be? No gener I CIE~T. How hIgh should a validity
. . a answer to thIS gr' .
quently, the correlation was lower in the later group, although the ac- mterpretation of a validit ffi . ues lOll IS pOSSIble, since the
of concomitant circumsta; coe clent ~ust take into account a number
t curacy with whkh individuals' grades were predicted showed little
ch~nge. In other words, the observed drop in correlation did not indicate be high enough to be sta~~~·' n;~ o~tamed correlation, of course, should
such as the 01 or 05 level' d~8.Ica !Jds~gnificant at some acceptable level
. that the predictors were less va-lid than they had been 30 years earlier. . . . s ISCusse in Cha t 5 I h '
Had the difference$ in group homogeneity been ignored, it might have drawing any conclusions about th • lid' per . not er words, before
" been 'Wrongly concluded that this was the case. sonably certain that the obt' d el~d~ Ity of a test, we should be rea~
. ame va I Ity coeffi' t ld
'0' For the propet interpretation of a validity coefficient, attention should throug~ chance fluctuatip.tls of sam Ii . Clen cou not ,have arisen
alm be given to the form of the relationship between test and criterion. Havmg establjshed a signiflcant p ng fro.m a true correlation of zero.
, The computation of a Pearson correlation coefficient a;;sumes that the re- criterion, however, we need to e correlat1~n between test Scores and
lationship is linear and uniform throughout the range. There is evidence I light of the uses to be m d f v~luate the SIZeo~ the correlation in ~he
I that in certain situations, however, these' conditions may not be met vidual's exact criterion s~ e 0 ~le test. If we WIsh to predict an indi-
, (Fisher, 1959; Kahneman & Ghiselli, 1962). Thus, a particular job may will receive in college the ~:fi~~t as:: grade-point average a student
" require a minimum level of reading comprehension, to enable employees of the standard erro; of estimare coe .clen.t may be interpreted in terms
to read instructiorl manuals, labels, and the like. Once this minimum is measurement discussed in : whl7h IS analogous to the ,error of
e:,tceeded, however, further increments in reading ability may be un- that the errOr of measure.:~;~c~~n WIth reliability. It wl"ll be recalled
related to degree of job success. This would be an example of a nonlinear pected in an individual's n Icates the margin,. of error to be ex-
relation between test and job performance. An examination of the bivari- ~irni1ar1y, the eITor of esti=~: :~ a res~t of th~ unreliability of the t~t.
ate distributjon or scat.\:er diagram obtained by plotting reading compre- m the individual's predicted 't o~s t e margm of ~r,rotto be expe~tec:l
hension scores a!Ylinst criterion measures would show a rise in job per- validity of the test. cn erIon score, as a ~lwf the imper{~(;t
I fprmance up to the minimal required reading ability and a leveling off The error of estimate is found b th f 11 . ";"'"
beyond that point. Hence, the entries would cluster around a curve rather Y e 0 owmg fOfn,ula:
Validity: AI easuremellt and Interpretation 167
6 Prillciples of Psychological Testing
test, which take into account the types of decisions to be made from the
: whichr2 >'V is the square of the validity coefficient and Uv is th~ ~tandard
scores. Some of these procedurcs will be illustrated in the following sec-
eviatiol1of the criterion scores. It will be noted that if the vahdlty were
tion.
erfect(r >'V ;::: 1.00), the error of estimate would be zero. On the other
and, with a test having zero validity, the error of estimate is as large as
e standard deviation of the criterion distribution (ucBr.;::: ulIVI -0 =
v), Under these conditions, the prediction is no better. than. a ~ues~; and
he range of prediction error is as wide as the enbre distnbutIOn of BASIC APPROACH. Let us suppose that 100 applicants have been given
criterionscores. Between these two extremes are to be found the errors fln aptitude test and followed up until each could be evaluated for suc-
ofestimate corresponding to tests of varying validity. cess on a certain job. Figure 17 shows the bh'ariate distribution of test
Reference to the formula for cr •• t. will show that the term VI - r'''11 scores and measures of job success for the 100 subjects. The correlation
, servesto indicate the size of the error relative to the error that wou~ between these two variables is slightly below .70. The minimum accept-
, result from a mere guess, i.e., with zero validity. In other words, lf able job performance, or criterion cutoff point, is indicated in the diagram
v'l- r'xv ig equal to 1.00, the error of estimate is ~s .lar~e as it would be by a heavy horizontal line. The 40 cases falling below this line would
if we were to guess the subject's score. The predlc~ve lmprove~~nt at- represent job failures; the 60 above the line, job successes. If all 100 appli-
tributable to the use of the test would thus be rol. If the validlty co- ~ants are hired, thereforc, 60 percent will succeed on the job. Similarly,
efficientio; .80, then VI - "XI/ is equal to .60, and -the error is 60 percent if a smaller number were hired at random, without reference to test
scores, the proportion of successes would probably be close to 60 percent.
aslarge as it would be by chance. To put it diffe:ently, the use of s~ch a
test enables us to predict the individual's critenon performance wlth a Suppose, however, that the test scores are used to select the 45 most
promising applicants out of the 100 (selection ratio;::: .45). In such a
marginof error that is 40 percent smaller than it would be if we were to
case, the 45 individuals falling to the right of the heavy vertical line
guess. . . . . would be chosen. Within this group of 45, it can be seen that there- arc 7
It would thus appear that even with a validlty of .80, whl~h 1S unusu~lIy
job failures, or false acceptances, falling below the heavy horizontal line,
high, the error of predicted scores is.conside~abl~..u th,e pnmary ~~ctl~n
and 38 job successes. Hence, the percentage of job successes is now.84
of psychological tests were to predlct each mdIvl~ual ~ exact l?OSlhO~in
the criterion distribution, the outlook would be qUite dlscouraglOg. \\ hen
rather than 60 (i.e., 38/45 = .84). This increase is attributable to the use
of the test as a screening instrument. It will be noted that errors in pre-
examined in the light of the error of estimate, mos~ t~sts do not appear
dic:te:d criterion score that do not affect the decision can be ignored.
very efficient. In most testing situations, ho~ev~r,. lt IS not necessary to
predict the specific criterion performance of mdlvl~ual .c~ses, but rather Opl)' those prediction errors that cross the cutoff line and hence place
to determine which individuals will exceed a certam mlmmum standard the individual in the wrong category will reduce the selective effective-
ness of the test .
of performance, or cutoff point, in the cri:erion. What are the ch.an:es
that Mary Greene will graduate from medIcal school, tI:at Tom Hlggms . For a complete evaluation of the effectiveness of the test as a screening
",'in pass a course in calculus, or that Beverly ~ruce WIll succeed as an mstrument, another category of cases in Figure 17 must also be examined.
astronaut? Which applicants are likely to be satlsfactory clc::rks,salesmen, This is the category of false re;ections, comprising the 22 persons who
or machine operators? Such information is ~seful not only fo~ ~roup i score below the cutoff point on the test but above the criterion cutoff.
selection but also for individual career planmng. For example, lt 15 ad- From these data we would estimate that 22 percent' of the total applicant
vantageous for a student to know that he has a gOO? chanc~ of pas~ing sample are potential job successes who will be lost if the test is used as a
all courses in law school, even if we are unable to estimate WIth certamty screening device with the present cutoff point. These false rejects in a
personnel selection situation correspond to the false positives in clinical
whether his grade average will be 74 or ~I: . .,
A test may appreciably improve predIctive effiCIency If It sho~s a~1J evaluations. The latter term has been adopted frO,J:lk medical practice, in
significant correlation with the criterion, however 10w..Un.der.certa~n Clt- whi~ .a t~st for a pathological condition is reported ~positive if the
cumstanees even validities as low as .20 or .30 may Justify lncluslon of condltion 1S present and negative if the patient is Dormal. A false positive
the test in ~ selection program. For many testing purposes, evaluation .of thus refers to ~ case in ~hich the test erroneously 4l~~atf,(~-1:hepresence
tests in terms of the error of estimate is unrealistically stringent. Consld- ?f ~ ?athologJ~1 condition, as when brain damage~,-~ mdicated in an
eration must be given to other ways of evaluating the contribution of a mdlVldual who lS actually normal. This terminology is likely to be COD-
Validity: Measurement and Interpretation 169
, I openin s, and the ur ('nc or seed with which t ,

filled.
'~
I I
In many personnel decisions, the selection ratio is determined by the
I practical demands of the situation. Because of supply and demand in
filling job openings, for example, it may be necessary to hire the top 40
percent of applicants in one case and the top 75 percent in another.
When the selection ratio is not externall,T imposed, the cutting smre 011
a test can be set at that point giving the maximum differentiation be.
Job tw~ Clilelioll grouEs. TIus can be done roughly by comparing the
Successes distrl ution of test scores in the two criterion groups. More precise math-
ematical procedures for setting optimal cutting scores have also been
worked out (Darlington & Stauffer, 1966; Guttman & Raju, 1965; Rorer,
Hoffman, La Forge, & Hsieh, 1966). These procedures make it possible to
take into account other relevant parameters, such as the relative serious-
Criterion ness of false rejections and false acceptances.
Cutoff
In the terminology of decision theofy, the example given in Figure 17
illustrates a simple strategy, or plan for deciding which applicants to ac-
Job cept and which to reject. I~ mor.e.~eral terms, a strategy is a technique
failures
for utilizing information in order to reach a decision about individuals. In
tTllscase, the strategy was to accept the 45 persons with the highest te;
scores. The increase in percentage of successful employees from 60 to 84
Low
Low could be used as a basis for estimating the net benefit resulting from the
use of the test.
Test Score
Statistical decision theory was developed by Wald (1950) with special
~'FIC. 17. Increase in the Proportion of "Successes" Resulting from the Use of
, a Selection Test. reference to the decisions required in the inspection and quality control
of industrial products. Many of its implications for the construction and
interpretation of psychological tests have been systematically worked out
r.' fusingunless we remember that in clinical practice a positiv~ result po a by Cronbach and GIeser (1965). Essentially, decision theory is an at-
, test denotes pathology and unfavorable diagnosis, whereas In pers~n~el tempt to put the decision-making process into mathematical form, so thdt
. selectiona positive result conventionally refers to a favorabJ~ prediCtIon available information may be used to arrive at the most effective decision
: regardingjob performance, academic achievement, and the lI~e. ~nder .s~ecified circumstances. The mathematical procedures employed
. In settin on a test, attention should be ven to the lD. d~c1Slon.th~ory a~e often quite complex, and few are in a form per-
'i. percentage of false rejects (or false positives as we as to the .erc:nt-i mItting theIr Immediate application to practical testing problems. Some
) a cesses an ai ures wit in t~_se eete grou.!} In certam SItu- of the basic concepts of decision theory, however, are proving helpful in
;; ations,the cutoff point should be set sufficiently higt, to e~clu?e all but the reformulation and clarification of certain questions about tests. A few
',' a few possible failures. This would be the case when t~~',;obIS of such of these ideas were introduced into testing before the formal develop-
!: a nature that a poorly qualified worker could cause senous loss or dam-
ment of statistical decision theory and were later recognized as Dtting
into that framework.
i age. An example would be a commercial airline. pilot. Under o.ther
:' circumstances, it may be more important to admit. as many qualIfied
~ personsas possible, at the risk of including more fallures .. In the latter
',> case the number of false rejects can be reduced by the choice of a lower
PREDICTION OF OUTCOMES. A precursor of decision theory ini.psychologi.
,~,cutoffscore. Other factors that normally determine the position of ~he ca.1testing is ~o b~ found in the Taylor-Russell table~( 193,~),which per-
mIt a detennmation of -the net gain in selection acc~racy atbibutable to
. -
."i, cutoffscore include the available personnel snpP4:, the number of job
"
l
the use of the test. ~ information required inc1\ip s'the validity co-
o Principles of Psychological Testing
selected after the use of the test. Thus, the difference between .60 and
cient of the test, the proportion of applicants who m~~t be acclep~e~ anyone table entry shows the increase in proportion of successful selec-
lection ratio), and the proportion of successfu~ app lc~n~ :: ~~r:e tions attributable to the test.
thout the use of the test (base rate). A change many 0 t I" Obviously if the selection ratio w.ere 100 percent, that is, if all appli-
ctors can alter the predictive efficiency of the test. cants had to be accepted, no test, howen'r valid, could improve the
For urposes of illustration, one of the Taylor-Russell tables has been selection process. Reference to Table 14 sho\\'s that, when as many as 95
e rod~eed in Table 14. This table is designed for us~ when the base percent of applicants must be admitted, even a test with perfect validity
.aie or ercenta e of successful applicants selected pnor to the use of ( r = 1.00) would raise the proportion of successful persons by only 3 per-
he test 1s 60. Ot~er tables are prOVided by Taylor and Russe~l for ~t~~r cent (.60 to .63). On the other hand, when only 5 percent of applicants
base ra~es Across the top of the table are given different va ues ~ .e need to be chosen, a test with a validity coefficient of only .30 can raise
selection ;atio, and along the side are the tes~ validities. The entnes 111 the percentage of successful applicants selected from 60 to 82. The rise
the' body of the table indicate the proportion of successful· persons from 60 to 82 represents the incremental vaUdity of the test (Sechrest,
1963), or the increase in predictive validity attributable to the test. It
TABLE 14 i ( f G' indicates the contribution the test makes to the selection of individuals
Proportionof "Successes" Expected through the Use 0 Test 0 lven who will meet the minimum standards in criterion performance. In ap-
Validityand Given Selection Ratio, for Base Rate .60. plying the Taylor-Russell tables, of course, test validity should be com-
(From Taylor and Russell, 1959, p. 576) . =-" puted on the same sort of group used to estimate percentage of prior
~':~~,",7"'J2'-':UliH~~'.:>,JI;~~,.:~!M.r ••_.:::·..:;.':5.~~~ successes. In other words, the contribution of the test is not evaluated
Selection Ratio against chance success unless 'applicants were preViously selected by
chance-a most unlikely circumstance. If applicants had been sele<;:teq
.30 .40 .50 .60 .70 .80 .90 .95
on the basis of previous job history, letters of recommendation, and inter-
.60 .60 .60 ·60 .60 views, the contribution of the test should be evaluated ODe. the- ,basis at
.61 .61 .61 .60 .60 what the test adds to these previous selection procedures. ..
.63 .62 .61 .61 .60 The incremental validity resul~~~ from the use of a test depends not
.64 .63 .62 .61 .61 only on the selection ratio but l\~'()ll the base rate. In the previously
.65 .64 .63 .62 .61 illustrated job selection situation, the base rale refers to the proportion of
successful employees prior to the introduction of the test for selection
.66 .65 .63 .62 .61 purposes. Table 14 shows the anticipated outcomes when the base rate
.68 .66 .64 .62. .6J is .60. For other base rates, we need to consult the other appropriate
.69 .67 .65 .63 .62
.63.62
tables in the cited reference (Taylor & Russell, 1939). Let us consider
.70 .68 .66
an example in which test validity is .40 and the selection ratio is 70 per-
.72 .69 .66 .64 .62
cent. Under these conditions, what would be the contribution or incre-
.73 .70 .67 .64 .62 mental validity of the test if we begin with a base rate of 50 percent?
.75 .71 .68 .64 .62 And what would be the contribution if we begin with more extreme base
.76 .73 .69 .65 .63 rates of 10 and 90 percent? Reference to the appropriate Taylor-Russell
.78 .74 .70 .65 .63 tables for these base rates shows that the percentage of successful em-
.80 .75 .71 .66 .63 ployees would rise from 50 to 75 in the Hrst case; from 10 to 21 in the
second; and from 9 to 99 in the third. Thus, the improvement in percent-
.96 .93 .90 .86 .81 .77 .71 .66 .63 age of successful employees attributable tQ .the use of the test is 25 when
.75 .99 .99
1.00 .98 .95 .92 .88 .83 .78 .72 .66 .63 the base rate was 50, but only 11 and 9 when the b,ase rates were more
.80 .99
1.00 .97 .95 .91 .86 .80 .73 .66 .63 extreme. .
.85 1.00 .99
1.00 1.00 .99 .97 .94 .88 .82 .74 .67 .63
.90 1.00 The implications of extreme base rates are of specia~,,interest in clinical
1.00 1.00 1.00 .99 .97 .92 .84 .75 .67 .63
.95 1.00 psychology, where the base rate refe~ to' the frequency of the patho-
1.00 1.00 1.00 1.00 1.00 1.00 .86 .75 .67 .63
1.00 1.00 lOgical condition to be diagnosed in the, p.qpulation tested (Buchwald,
Princillies of PSljcllological Testing
0
0
,t1965; Cureton, 1957a; Meehl & Rosen, 1955; J. S. Wiggins, 1973). For ,..(
:)example, if 5 percent of the intake population of a clinic has organic

It:l
:brain damage, then 5 percent is the base rate of brain damage in this ~
,~population. Although the introduction of any valid test win improve
:~.predictive or diagnostic accuracy, the improvement is greatest when the 0
~
. base rates are closest to 50 percent. '''ith the extreme base rates found
;i'wfth rare pathological conditions, however, the improvement may be It:l
ce:
.:, negligible. Under these conditions, the use of a test may prove to be
unjustified when the cost of its administration and scoring is taken into 0
'; account. In a clinical situation, this cost would include the time of pro- ~
fessional personnel that IDlght otherwise be spent on the treatment of
It:l
• additional cases (Buchwald. 1965). The number of false positives, or 1:-:
normal individuals incorrectly classified as pathological, would of course
0
increase this overall cost in a clinical situation. I,,;
"'Then the seriousness of a rare condition makes its diagnosis urgent,
.. tests of moderate validity may be employed in an early stage of sequential It:l
decisions. For example, all cases might first be screened with an easily ~
administered test of moderate validity. If the cutoff score is set high ..,
c 0
enough (high scores being favorable), there will be few false negatives II)
C'1
·8
but many false positives, or normals diagnosed as pathological. The latter esIII
\I')
0
can then be detected through a more intensive individual examination C,) ~
given to all cases diagnosed as positive by the test. This solution would .e-
be appropriate, for instance, when available facilities !Jlake the intensive :2 0
~
individual examination of all cases impracticable. ~
RELATION OF VALIDITY TO MEAN OUTPUT LEVEL. In many practical situ-

ations, what is wanted is an estimate of the effect of the selection test,
not on percentage of persons exceeding the minimum performance, but
on overall output of the selected persons. How does the actual level of
job proficiency or criterion achievement of the workers hired on the
basis of the test compare with that of the total applicant sample that
would have been hired without the test? Following the work of Taylor
and Russell, several investigators addressed themselves to this question
(Brogden, 1946; Brown & Ghiselli, 1953; Jarrett, 1948; Richardson, 1944).
Brogden (1946) first demonstrated that the expected increase in output
is directly proportional to the validity of the test. Thus, the improvement
resulting from the use of a test of validity .50 is 50 percent as great as the
improvement expected from a test of perfect validity.
The relation between test validity and expected rise in criterion
achievement can be readily seen in Table 15. Expressing criterion scores
1
o
~
1A table including more values for both selection ratios and validity coefficients
~ c.2
was prepared by Naylor and Shine (1965). Qj:8~
tn IX:
vidual's preferences and I h
standard scores with a mean of zero and an SD of 1.00, this table gives out, however, that decisi:: Ut~:~ste~: It a~ been repeatedly pointed
e expected mean criterion score of workers selected with a test of given values into the d .. ry Id not mtroduce the problem of
eClSlon process, but merely made it explicit Value-:"·
idity and with a given selection ratio. In this context, the base output tems h ave a 1ways enter d . t d .. . ~
an, corresponding to the performance of applicants selected without clearly re~gnized or sy:te:a~ica~~s~:~dl~~~ they were not heretofore
se-ofthe test, is given in the column for zero validity. Using a test with
ero validity is equivalent to using no test at all. To illustrate the use of
In choosmg a decision strate
utilities across all outcome
of a Simple de . .
R lY'th 1 .
e goa IS to maximize expected
s. . e er.enee to the schematic representation
he table, let us assume that the highest scoring 20 percent of the appli-
cedure Th' d~Islon strategy m FIgure 18 \vill help to clarify the pro-
.cantsare hired, (selection ratio == .20) by means of a test whose validity
17 in :.vhic~ la~ralm sho~s the decision strategy illustrated in Figure
coefficientis.50. Reference to Table 15 shows that the mean criterion the' d " a smg e test IS administered to a group of applicants and
.performance of this group is .70 SD above the expected base mean of an
Illitested sample. \Vith the same 20 percent selection ratio and a perfect cutoff eClSIon
score to
on accept
the t or . t an app Iicant is made on the basis of a
t ~Jec
valid and fals es. here are four possible outcomes, including
=
test (validity coefficient 1.00), the mean criterion score of the accepted
ability of he acceptances and valid and false rejections. The prob-
applicants }vould be 1.40, just twice what it would be with the test of eac outcome can be f d fr h
validity .50. Similar direct linear relations will be found if other mean each of the four sectio . oun om t e number of persons in
in that example th ns ofbFIgu:e. 17. Since there were 100 applicants
criterion performances are compared within any roW of Table 15. For
instance, with a selection ratio of 60 percent, a validity of .25 yields a
, ese num ers dIVIded b 100'
the four outcomes listed in Fi
utilities of the diff gure.
. 18 rh h
gIVe t e probabilities of
e other data needed are the
mean criterion score of .16, while a validity of .50 yields a mean of .32. erent outcomes expre d
Again, doubling the validity doubles the output rise.
The evaluation of test validity in terms of either mean predicted out-
expected overall utili
ing the probability t of th ' sse on a common scale. The
h e,strategy could then be found by multiply-
put or proportion of persons exceeding a minimum criterion cutoff is these products forOt~:c faoutco~e by the utility of the outcom~, adding
respondin to h u~ ou comes, and subtracting a value cor-
obviously much more favorable than an evaluation based on the previ-
ously discussed error of estimate. The reason for the difference is that
prediction errors that do not affect decisions are irrelevant to the selec-
test of 10:
val:d~;~:t;~r:e~:~:r
easily administered by reIat' r
t~h~s last ~erm ~i?h~ights th~ fact that a
e r~tamed If It IS short, mexpensive,
tion situation. For example, if Smith and Jones are both superior workers group administration An 1· IdV:~dun1tramed personnel, and suitable for
. n IVI ua test req . . t' d
and are both hired on the basis of the test, it does not matter if the test or expensive equipment would llee d a h'Igh er Uln~g.
vahdltya torame
justifyexaminer
its use.
shows Smith to be better than Jones while in job performance Jones
excels Smith.
Decision Outcome Probability

TIlE ROLE OF VALUES IN DECISION TIIEORY. It is characteristic of decision
theory that tests are evaluated in terms of their effectiveness in a specific Valid Acceptance .38
situation. Such evaluation takes into account not only the validity of the
test in predicting a particular criterion but also a number of other False Acceptance .07
parameters, including base ra:e and s~~ Another important' Administer test
and C1pply
parameter is the relative utility of expected outcomes, the judged favor-
cutoff score
ableness or unfl\.vorablcness of each outcome. The lack of adequate V~lid Rejection .33
systems for assigning values to outcomes in terms of a uniform utility
scale is one of the chief obstacles to the application of decision theory.
False';Rejection .22
In industrial decisions, a dollar-and-cents value can frequently be as-
Signed to different outcomes. Even in such cases, however, certain out- FIG. 18. A$imple Decision Strategy.
comes pertaining to good will, public relations, and employee morale are
difficult to assess in monetary terms. Educational decisions must take into
2 For ,w fl'ctitious example illustraf all .'
account institutional goals, social values, and other relatively intangible Wiggins"0973), pp. 257-274. mg steps II! these computations, see y, a
factors. Individual decisions, as in counseling, must consider the indi-
It should also be noted that many personnel decisions are in effect
sequential, although they may not be so perceived. Incompetent em-
';, SEQUEXTIAL STRATEGIES AND ADAPTIVE TREATMENTS. In some situations,
ployees hired because of prediction errors can usually be discharged
~;the effectiveness of a test may be increased through the use ~f more after a probationary period; failing students can be dropped from col-
complex decision strategies which take still more param,etoe~s.lllto .ac- lege at several stages. In such situations, it is only adverse selection
. count. Two examples will serve to illustrate these poss~blhtles, ,~ust,
decisions that are terminal. To be sure, incorrect selection decisionS- that
, ,t t may be used to make sequential rather than termmal deCISIOns,
'. es sod' F' 17 d 18 aU are later rectified may be costly in terms of several value systems. }Jut
;,With the simple decision strategy Illustrate III 19ures an , they are often less costly than terminal wrong decisions. " ",
:"decisions to accept or reject ar: treated as terminal. Figure 19, on the
A second condition that may alter the effectiveness of a psychological
" other hand, shows a two-stage sequential decision, T~st A could be a
test is, the availability of alternative treatments and the possibility of
,';shortand easilv administered screening test. On the baSIS of. per~orma~ce
adaptmg treatments to individual characteristics, An example would be
, on this test, in'dividuals would be sorted into three categ?nes.: mcludl~~
the utilization of different training procedures for workers at different
those clearly accepted or rejected, as well s. 3n in~ermedlat~ uncertam
aptitude levels, or thc introduction of CQ.l!lpensatory educational pro-
group to be examined further with more intenSIve tec~mque~, repre-
grams for students with certain educational disabilities. Under these
.. sented by Test B. On the basis of t~e second-sta?e testmg, tIllS group
conditions, the decision strategy followed in individual eases should take
,; wouldbe sorted into accepted and rejected categorIes. into account available data on the interaction of initial test score and dif-
ferential treatment. When adaptive treatments ar~ utilized, the success
rate js likely to be substantially improved. Be£ause, the assignment of
in<!ividuals to alternative treatments is essentiallyadilsSif'ication rather
tharu-sel~oblem, -mOre wiI[ be Sald about tlleTequired method-
ology in a later section on classification decisions.
The examples cited illustrate a few of the ways in which the concepts
an~ rationale of decision theory can assist in the evaluation of psycho-
logIcal tests for specific testing purposes, Essentially, decision theory has
served to focus attention on the complexity of factors that determine the
Such sequential testing can also be cmployed within a si~gle test~ng contribution a given test can make in a particular situation. The validity
, f t to t'm (DeWItt & \Velss coefficient alone cannot indicate whether or not a test should be used
,> session,to ma'Ximize the effectlve usc 0 es mg Ie.'
..~. 1974; Linn, Rock, & Cleary, 1969; Weiss- -& 13etz, 1973): Altho~gh. appli- ~ince it is only one of the factors to be considered in evaluating th~
cable to paper-and-pencil printed grou~ ~ts, seq~entIal testmg IS par- lmpact of the test on the efficacy of the total decision process) .
ticularly well suited for computer testing, ~ssenhally the sequen~e ~f
items ~r item groups 'within the test is determine? b~ the examl,nee s
ownperfom1anceo For example, everyone might begm w1th a set of Ite~s
of intermediate difficulty. Those who score poorly are routed t? easIer
DIFFERENTIALLY PIlEDlCTABLE SUBSETS OF PERSONS. The validity of a
items' those who score well, to more difficult items. Such branchmg may I
oeeu; repeatedly at several stages, The princip~l eff.e~t is that each test for a given criterion may vary among subgroups differing in personal
examinee attempts only those items suited to h~s abJ1l~y level, rather characteristics. The classic psychometric model assumes that prediction
than trying all items, Sequential testing ~~del.s WIll be dlscusse~ further errors are characteristic of the test rather than of the person and that
in Chapter 11, in connection with the utlhzahon of computers 10 group these errors are randomly distributed among persons. With the flexibility
of ap~roach ushe,re~ in by decision theory, there has been increasing ex-
testing. hI' I d' d ploration of prediction models involving interacti~ hetween persons and
Another strategy, suitable for the diagnosis of psye 0 ogICa 1~or ers,
is to use only two categories, but to test further. a~ cases clas~ified as
.. positives (i.e., possibly pathol~gi~al) ~y the.prel~mmary sc~eem~g test. 3 For a ,fuller discussion of the implications of decision theory for test use, see
.', This is the strategy cited earlIer ll1 this. s.e~tion, ~n connection Wlth the J. S. Wlggms (1973), Ch. 6, and at a more technical level; Cronbach and GIeser
(1965), .
use of tests to diag,nose pathological condItIons With very low base rates.
,.~\
Validity: Mr:a~'U"C11lentand Interpretation 179
178 Principles of Psychological Testing
~
.' ts. Such interaction implies that the same test may be a better pre- predict achievement from test scores. Whatever the reason for the
tor for cert~i~Ciasses or subsets of persons than it is for others. For difference, sex does a ear to function as a moderator variable in the
xamplc,a given test may be a better predic~or of criterio~ performance predictability of academic gra es from aphtu e test scores.
or men than for women, or a better predlctor for applicants from a A number of investigations have been specially designed to assess the
ower than for applicants from a higher socioeconomic level. In. these role of moderator variables in the prediction of academic achievement.
xamples,sex and socioeconomic level are known as moderator vanables, Several studies (Frederiksen & Cilbert, 1960; Frederiksen & Melville,
sincethey moderate the validity of the test (Saunders, 1956). 1954; Stricker, 1966) tested the hypothesis that the more compulsive
I When computed in a total group, the vali~ity coe!R<,ient of a test may students, identified through two tests of compulsivity, Y{,ouldput a great
'be too low to be of much practical value In prcdlction. But when reo deal of effort into their course work, regardless of their interest in the
< computed in subsets of individuals differing in some i~e~tifia?le charac- courses, but that the effort of the less compulsive students would depend
, teristic, validity may be high in one subset and negl1g~~le In anot~er. on their interest. Since effort will be reflected in grades, the correlation
; The test could thus be used effectively in making declSJons regardmg between the appropriate interest test scores and grades should be higher
! persolls in the first group but not in the second. Per~aps anothe~ test or among noncompulsive than among compulsive students. This hypothesis
" some other assessment device could be found that IS an effective pre- was confirmed in several groups of male engineering students, but not
dictor in the second group. . among liberal arts students of either sex. Moreover, lack of agreement
A moderator variable is some characteristic of persons that makes It among different indicators of compulsivity casts doubt on the generality
i posS'ibfeto-pre'ct e pre ictability 0 I erent 10 ividuals with a given of the construct that was being measured.
ins rument. t may e a emograp lC vana e, such as sex, age, e u- In another study (Grooms & Endler, 1960), the college grades of the
.al level, or socioeconomic background; or it may be a score on more anxious students correlated higher (r =
.63) with aptitude and
another test. Interests and motUlation often function as moderator achievement test scores than did the grades of the less anxious litudents
variables. Thus, if an applicant has little interest in a job, he will prob- (r= .19). A different approach is illustrated by Berdie (1961), who in-
ably perform poorly regardless of his scores on relevant aptitude tests. vestigated the relation between intraindividual variability on a test and
Among such persons, the correlation between aptitude test scores and the predictive '-'ilidity of the same test. It was hypothesized that a given
job performance would be low. For individuals who are interested and test will be a- better predictor for those individuals who perform more
highly motivated, on the other hand, the correlation between aptitude consistently in different parts of the test-and whose total scores are thus
test score and job success may be quite high. more reliable. Although the hypothesis was partially confirmed, the re-
lation proved to be more complex than anticipated (Berdie, 1969).
In a different context, there is evidence that self-report personality in-
EMPmlCALEXAMPLESOF MODERATOR VARIABLES. Evidence for the op- ventories may have higher validity for some types of neurotics than for
eration of moderator variables comes from a variety of sources. In a sur- others (Fulkerson, 1959). The characteristic behavior of the two types
vey of several hundred correlation coefficients between ap~tude test tends to make one type careful and accurate in reporting symptoms, the
scores and academic grades, H. G. Seashore (1962) found htgher cor- o~her ~areless and evasive. The individual who is characteristically pre-
relations for women than for men in the large majority of instances. Tht; ClSe and careful about details, who tends to worry about his problems,
same trend was founa in high sChool and college, although the trend was and who uses intellectualization as a primary defense is likely to provide
more pronounced at the coll~ge level. ~he ?~ta do not in.dicate. the a more accurate picture of his emotional difficulties on a self-report in-
reason for this sex difference in the predictabIhty of academlc achieve- ventory than is the impulsive, careless individual who tends to avoid
ment, but it may be interesting to speculate about it in the light of other expressing unpleasant thoughts and emotions and who llses denial as a
known sex differences. If women students in general tend to be more primary defense.
conforming and more inclined to accept the values and standards of the Ghi~elli (1956, 1960a, 1960b, 1963, 1968; Chise~!C Sander~, 1967) has
school situation, theiJ;class achievement will probably devend largely on extenslvely explored the role of moderator variaBles iIl. UidiIstrial situ-
their abilities. If, on the other hand, men students tend to concentrate ations. In a study of taxi drivers (Ghiselli, 1956), the @rrelati~n between
their efforts on those activities (in or out of school) that arouse their an aptitude test and a job-performance criterion in the t6tl;J applicant
individual interests, these interest differences wO..!,Jldintroduce additional sa'ijl~ was only .220. The group was then sorted into tpirds qp the basis
val'ianee-in their......courseachiev~t and would make it more difficult to ~ ~~ ..~ on an occupational interest test. When the validity of the
180 Principles of Psychological Testing ,_~ Validity: Measurement and Interpretation 181
aptitude test was recomputed within the third whose occupational in- /?I?'e:-T~!'~redict a s~n.gle.criterion, they are known as a test batten(. 1Jpe chief
terest level was most appropriate for the job, it rose to .664. Of:tlp.fJ701(.fl.problem arlsmg 10 the use of such batteries concerns the way 'in which
A technique employed by Chiselli in much of his research consists in scores ,on the di~ert:n~ tests are to be combined in arrivi,!!g at a decision
finding for each individual the absolute difference (D) between his regardmg each IndiVIdual. The. statistical procedures followed for this
:. actual and his predicted criterion scores. The smaller the value of D, the purpose ~re of tw.g major typ:s, namely, multiple regression equation
, morepredictable is the individual's criterion score. A predictability scale and multiple cutoff scores. --------'--:...::-.~-=..::!.:.::.:.:.:=___, ..
~; is then developed by comparing the item responses of two contrasted ~Vhe~~ts ~re adIriinistered in the intensive study of individual cases,?1t~/et-,VV1C
:- subgroups selected on the basis of their D scores. The predictability :: 111 ~li~lCaldiagnosis, counseling, or the evaluation of high-level execu//' I ' "."",,-
-: scaleis subsequently applied to a new sample, to identify highly pre- ves, It Isa£.QmIDOlLpr.actice.fOLtb~aminer to utilize test scores with~1
o dictableand poorly predictable subgroups, and the validity of the original out further st~tistical...analpis.-W preparing a case report and in making!
. testis compared in these two subgroups. This approach has shown con- recom~endatI~ns, the examiner relies on judgment, past experience, and \
siderablepromise as a means of identifying persons for whom a test will theoret~cal ratIOnale to interpret score patterns and integrate findings \
" be a good or a poor predictor. An extension of the same procedure has from dl~erent tests. Such clinical use of test scores will be discussed \
.'been developed to determine in advance which of two tests will be a further 1ll Chapter 16. \
'\ better predictor for each individual (Chiselli, 1960a).
Other investigators (Dunnette, 1972; Hobert & Dmmette, 1967) have
"'. argued that Chiselli's D index, based on the absolute amount of pre- • MULTIPLE ~RESS~ON. The multiple regression equation
EQUATION.
\
'\ I '--"1 ':'
.~dictionerror without regard to direction of error, may obscure important YIelds a predicted cntenon Score for each individual on the basis of his 'V~,
,1 ..-;f ir
individualdifferences. Alternative procedures, involving separate analyses score
.', '.
. b t . '1'1.. £ II ._-'
a te ~L1e 10 owing regression equation
U...
v(.{li,(,Gi/ltt/Vl.
_-1.::- ....
-------
of overpredicted and underpredicted cases, have accordingly been pro- illu~trates the applIcation of this technique to predicting a student's J
'posed. . achIevement in high school mathematics courses from his scores on Ii,)'
;\ Atthis time the identification and use of moderator variables are still verbal (V), numerical (N), and reasoning (R) tests: L :1"11
,'i,n' an explor;tory ·phase. Considerable caution is_required to avoid
methodologicalpitfalls (see, e,g., Abrahams & Alf, 1972a, 1972b; Dun- . Math~matics Achievement =: .21 V + .21 N + .82 R + 1.35, 'I
nette,1972;Ghiselh, 1972; Velicer, 1972a, 1972b). The results are usually In t~IS ~quabon, the student's stanine Score on each of the three tests is 1 I:
~9uitespecific to the situations in which they were obtained. And it is multiplied by the corresponding weight given in the equation. The sum :1
iinportant to check the extent to which the use of moderators actually of t~c..sepr~~uct~, plus a constant (1.35), gives the student's predicted ';:
'proves the prediction that could be achieved through other more s~!!~ne pOSItIon lD mathematics courses. 1 ;
'rect means (Pinder, 1973). :.'Suppose that Bill Jones receives the following stanine scores: I ".
Verbal 6 ; :If
~::~~~ : 1 'lii,'1
':;xForthe prediction of practical criteria, not one but several tests are The estimated ma'h
Lema ti'cs ach'levement of this student is found as 'i i,II
eperallyrequired. Most: criteria are complex, the criterion measure de- follows: '
. ing on a number of different traits. A single test designed to measure 1 fill
Math. Achiev. == (.21)(6) + (.21) ( 4) +( .32)( 8) + 1.35 = 6.01
a criteriQn would thus have to be highly heterogeneous. It has al-
y been pointed out, however, that~ re~!i.~~ homogeneous _~~st, Bill's ~redictcd stanine is approximately 6. It ~l be recalled (Ch. 4) that
I illl'
suringlargely' a singlet~ is more satisfactory b~~.e_iL)'ieIasJess--.--. a stanme of 5 represents average pedormance. Bill would thus be ex- 1 i: \
. ---US-Scores ('Ch-:;)). Hence, it is usually preferable to use a pected to ~o somewhat better than average in mathe~tics courses. His
ination of several relatively homogeneous t~sts, each covering a very supenor performance in the reasoning test (R =8') and his above- (i
ent aspect of the criterion, rather than a single test consisting of a ~verage score on the verbal test (V = 6) compensate for his poor score
podge of many diffe:rent kinds of items. 10 spee~ and a~uracy of computation (N 4). = _
en a number of speciaUy selected tests are employed together to SpecIRc techmques for the computation of regression equations can be
11
111:1
'I,i!
J 'J, ,"

Anne Anastasi - Psychological Testing I

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Anne Anastasi - Psychological Testing I

Uploaded by

Copyright:

Available Formats

ANNE~NASTASI

Professor of Psychology, Fordham Universiry

MACMILLAN PUBLISHING CO., INC.

Collier Maonillan Publishers

1. FUNCTIONS AND ORIGINS OF

2. NATURE AND USE OF

3. SOCIAL AND ETHICAL IMPLICATIONS

7. VALIDITY: MEASUREMENT AND

Diagnostic use of intelligence tests 465

17. SELF-REPORT INVENTORIES

18. MEASURES OF INTERESTS, ATTITUDES,

19. PROJECTIVE TECHNIQUES

A . is meant by a psychological test, It would be easy enough to recall

Basically, the function of psychological tests is to measure ,9.:iffe~~~.n~L_

mathematically untrained investigator who might wish to treat test re-

BEHAVIOR SAMPLE..-A, psychological test is essentially an objective

b three years (Angoff, 1971b; Droege, 1966; Peel, 1951, 1952).

The requirement that tests be used only by appropriately qualified

onl~. s~ould the data be interpreted by a properly qualified person, but

~prl;;e~st~t~e:~ .~:e~c:; also~n~~e:~:s:e;: :~~~e;e~7c ~:::o::~,

an IQ would thus serve to perpetuate their handicap. It is largely be-

OBJECTIVITY OF TESTS. "'hen social stereot:'pes and prejudice may dis-

Many bright, non-conforming pupils, with backgrounds different from those of

\Vith regard to personnel selection, the contr!!>ution:,of t~sts was aptly

NTHE absence of additional interpretive data, a raw score on any

I psychological test is meaningless. To say that an individual has

(Data from Table 1.)

Flc.1. Distribution Curves: Frequenc\: polygon and Histogram. 50% of {:~

most familiar of these measures is the average, more technically known

Norms and tile Interpretation of Test Scores 79

ributions. . +20- +30-

John Mary Ellen Edgar Jane Dick Bill Debby

-" 3 Kaiser (1958) proposed a modification of the staninl!'scale thaq~volves slight

esyTest Department, Ha~court Brace Jovanovich, Inc.)

120-129 4.3 6.3 7.5 8.5 Test score

Total 100.0 100.0 100.0

of the SD imperative. 5 10 20 30 40 50 60 10 80 90 95 !l9

INTERRELATIONSHIPS OF WITHIN-GROUP SCORES,At this stage in our dis;

1 " Norms and tIle Interpretation of Test Scores 99

-with the degree of consistency or agreement between two in de-

lion coefficient, Accordingly, the next section will consider some

use and interpretation, More technical discussion of correlation, as oX .Jifflll

such as Guilford and Fruchter (1973). mr I !, !

fEAl\,~G OF CORRELATION. Score on Variable J

illiance identified by each. Since all types of reliability are con-

'on coefficient. Accordingly, the next section will consider some

; such as Guilford and Fruchter (1973), lilt I

Ic.9. Bivariate Distribution for a Hypothetical Correlation of -1.00.

of the subjects on whom reliability was measured, such as educational or

h d' " 'tems but more poor b

A clear description of the computational layout for finding coefficient

. 6 For a better estimate of the coefficientqf internal consistency.split-half correla-

, , I ,/1 I II i,l I! ~I I' I

the difference between the Verbal Reasoning and Numerical Ability

the separate scores. By substituting SDyll - TII for Umeus,) and

sample of the behavior domain to be measured. Such a validation -pro-

with subject-matter experts. On the basis of the information thus gath-

jectives given in the Taxonomy of ~ducational Objectives (Bloom ~

educational measurement, this handbook also provides examples of

ilable, covering cognitive and affective domains, respectively. The "

, s and classifying items should be described. If subject-matter experts

eilt change over time, it is paI:tJcularly desirable to give the dates